Assignment 4: Text Processing

Overview

The goal of this assignment is to familiarize you with processing strings, sequences, and files through exercises involving looping over the contents of a string, and reading and writing files.

Grading Rubric

FeatureValue
Correct one-character count from count_char4
Correct number of words from num_words8
Correct number of lines2
Statistics correctly calculated/printed6
Asks for single character to count4
Works from console4
Works from files4
Thorough tests for count_char4
Thorough tests for num_words4
Uses forbidden string methods-15
Good use of comments4
Good use of variable names4
Good use of whitespace4
Good use of loops and conditionals4

Part 0: Pre-lab Planning

Create a week4 folder and open it in VS Code. Download the starter package and copy text_processing.py, text_processing_tester.py, and the examples folder into your week4 folder. Read through the rest of this writeup and think about the following questions:

  1. In num_characters(), when iterating the characters over a given text, how will you determine whether or not a character is a non-whitespace character? Try writing out a boolean expression on paper.
  2. In num_words(), how do you identify a word (defined as a consecutive sequence of non-whitespace characters) in a given text? Please explain in words.
  3. In main(), how do you calculate the average word length in a given text? Please explain in words.

Part 1: Simple text processing (In-lab)

In this part, you will write some functions that process text files.

Allowed String Operations

In class and elsewhere you may have been introduced to a number of string operations. In writing the code for this lab you should restrict yourself to the following:

These limitations are to give you practice working with strings as iterables, rather than using Python's extensive built-in string processing utilities, to prepare us for working with collection types in the weeks ahead.

1. every_fourth_char

Define a function every_fourth_char that takes one parameter string of type str and returns a new string constructed from every fourth letter. For example every_fourth_char("abcdefghijklmnopqrstuvwxyz") should evaluate to "aeimquy". Write a test case for this function in text_processing_tester.py.

2. copy_parts_of_file

Define a function copy_parts_of_file that takes two parameters (old_filename and new_filename, both of type str), and creates a new file named new_filename whose contents are the every fourth character of the equivalent line in the file old_filename. For example, if the original file contains the following Shel Silverstein poem:

Oh, if you're a bird, be an early bird
And catch the worm for your breakfast plate.
If you're a bird, be an early bird-
But if you're a worm, sleep late.

The new file should have the contents:

Oioeb, eyr
Acheroobkta
Ioeb, eyr
Bioew,el.

Your implementation should use every_fourth_char as a helper function.

Open the files "examples/earlybird.txt" and "examples/fourth.txt" and look at the contents. Then try running your function with old_filename set to each of these filenames. Remember to set new_filename to a different value to avoid overwriting other files. Note that it is not required that you formally test this function, but you should convince yourself that it is correct.

You should not consider the newline at the end of a string to be a character. You can eliminate these from a line with the string rstrip ("right strip") method. Rather than passing each line to every_fourth_char, you might prefer to pass line.rstrip("\n")

Note: Open the file to read using the latin character set (this is an option in the file open command). For example:

f = open(file_name,'r',encoding='latin-1')

3. num_characters(string)

This function takes a single parameter string (type str). It returns the number of non-whitespace characters in string.

For example:

   num_characters("Happy haPpy     day!   !")

should evaluate to 15 (type int).

Hint: Remember that the string module in Python provides pre-defined strings with names such as whitespace and punctuation and ascii_lowercase. This means, for example, that the following code will not print anything.

from string import *
for chara in ascii_lowercase:
    if chara == 'K':
        print("found K")
        break

You might try implementing an is_whitespace(char) helper function.

Once you believe your function is correct, you must define test cases for this function in the file text_processing_tester.py to thoroughly test this function.

4. count_characters(filename)

This function takes a single parameter filename (type str). It returns the number of non-whitespace characters in that file.

Try running this function with filename/alice.txt to count the number of non-whitespace characters in Lewis Carroll's Alice in Wonderland. It is not necessary to formally test this function, but you should convince yourself that it is correct. My program found 121,981 non-whitespace characters in examples/alice.txt.

5. Looking forward to part 2

For Part 2 you will do something a little bit trickier than num_characters. Rather than being asked to count all of the non-whitespace characters in a string, you will be asked to count the instances of a particular character, and to do so independently of whether it appears in upper or lower case.

Then consider the following poem by E. E. Cummings.

l(a

le
af 
fa

ll 

s) 
one
l

iness

If the user asks to count the character "a", how many instances do you see? How would you modify num_characters algorithm to do this?

Part 2: More text processing

For this assignment you will implement something similar to wc, a Unix command that "displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output."

In particular, you will implement 2 additional functions wih the names and the parameters described below (with the parameters in the order described). Note that the implementation for these three functions must loop character by character through the input string. You may not, for example, use the find or count string methods.

In addition, the main() function is a third function that must do exactly what is described. And there are other requirements as well, so please read and double-check your work carefully!

Incremental development and thinking through the order in which you will implement and test things will be very important. Here is one potential order:

count_char(string, char)

This function takes two parameters. Both have type str. The second argument is additionally guaranteed to be only a single character (guaranteeing that this is true is the responsibility of whoever calls this function).

This function returns the number of times the character char appears in string. The function should ignore capitalization. For example:

   count_char("Happy happy haPPY", "y")
   count_char("HAPPY HAPPY HAPPY", "y")
   count_char("HAPPY HAPPY HAPPY", "Y")

should all return 3 (type int).

Once you believe your implementation is correct, add test cases to the test file text_processing_tester.py to thoroughly test your function.

num_words(string)

This function takes a single parameter (type str). It returns the number of words in string (return type is int). In this case a word is defined as a consecutive sequence of non-whitespace characters.

num_words("Happy haPpy day! !") num_words(" Happy haPpy day! ")

should evaluate to 4 and to 3 respectively.

Once you believe your implementation is correct, add test cases to the test file text_processing_tester.py to thoroughly test your function.

main()

To put this all together, your program's main function should:

Note: Since main functions don't have parameters (and return None) we can't use our test file to thoroughly test our main function. But you should still convince yourself that your function is correct.

Going above and beyond

Some suggestions if you want to do more:

Note that you should not change the functionality to the named functions above (since we'll be testing them to make sure they meet the specifications as described).

Instead, you should add new functions with different names that do different things and then describe these functions in that multiline comment at the top of your file submission. I would also recommend having a main2() function which, when executed, uses your new and improved functions.

Incremental development and testing will be critical! Note the suggested implementation order above.

Sample run

single letter to count:
    asdf
you must enter a single letter!
single letter to count:
    2345
you must enter a single letter!
single letter to count:
    !
you must enter a single letter!
single letter to count:
    e

enter 1 for file or 0 for interactive 0 input line or -1 to stop: A is for apple input line or -1 to stop: B is for banana input line or -1 to stop: C is for canteloupe input line or -1 to stop: -1 * statistics * 3 lines 12 words 38 non-whitespace characters 3 e's average word length is: 3.1666666666666665 percentage e's is: 7.894736842105263

And an example of running in file mode:

single letter to count:
    p
enter 1 for file or 0 for interactive
    1
filename?:
    input1.txt

* statistics *

7 lines
66 words
386 non-whitespace characters
12 p's

average word length is: 5.848484848484849
percentage p's is: 3.1088082901554404

Coding Style

Make sure that your program is properly commented and uses appropriate Python style:

Submission

For this lab you are required to submit the following files:

Note that we will deduct points if your files are incorrectly named, if you do not include your names in the correct place, or if you do not include both files in your last submission. Please double and triple check this before submitting!