Assignment 4: Text Processing

Overview

The goal of this assignment is to familiarize you with processing strings, sequences, and files through exercises involving looping over the contents of a string, and reading and writing files.

Grading Rubric

Feature	Value
Correct one-character count from `count_char`	4
Correct number of words from `num_words`	8
Correct number of lines	2
Statistics correctly calculated/printed	6
Asks for single character to count	4
Works from console	4
Works from files	4
Thorough tests for `count_char`	4
Thorough tests for `num_words`	4
Uses forbidden string methods	-15
Good use of comments	4
Good use of variable names	4
Good use of whitespace	4
Good use of loops and conditionals	4

Part 0: Pre-lab Planning

Create a week4 folder and open it in VS Code. Download the starter package and copy text_processing.py, text_processing_tester.py, and the examples folder into your week4 folder. Read through the rest of this writeup and think about the following questions:

In num_characters(), when iterating the characters over a given text, how will you determine whether or not a character is a non-whitespace character? Try writing out a boolean expression on paper.
In num_words(), how do you identify a word (defined as a consecutive sequence of non-whitespace characters) in a given text? Please explain in words.
In main(), how do you calculate the average word length in a given text? Please explain in words.

Part 1: Simple text processing (In-lab)

In this part, you will write some functions that process text files.

Allowed String Operations

In class and elsewhere you may have been introduced to a number of string operations. In writing the code for this lab you should restrict yourself to the following:

len
Indexing as in s[3] (but not general "slicing" with colons)
lower(),upper()
in used as a for loop iterator (e.g. for chara in string)
rstrip() as discussed below

These limitations are to give you practice working with strings as iterables, rather than using Python's extensive built-in string processing utilities, to prepare us for working with collection types in the weeks ahead.

1. `every_fourth_char`

Define a function every_fourth_char that takes one parameter string of type str and returns a new string constructed from every fourth letter. For example every_fourth_char("abcdefghijklmnopqrstuvwxyz") should evaluate to "aeimquy". Write a test case for this function in text_processing_tester.py.

2. `copy_parts_of_file`

Define a function copy_parts_of_file that takes two parameters (old_filename and new_filename, both of type str), and creates a new file named new_filename whose contents are the every fourth character of the equivalent line in the file old_filename. For example, if the original file contains the following Shel Silverstein poem:

Oh, if you're a bird, be an early bird
And catch the worm for your breakfast plate.
If you're a bird, be an early bird-
But if you're a worm, sleep late.

The new file should have the contents:

Oioeb, eyr
Acheroobkta
Ioeb, eyr
Bioew,el.

Your implementation should use every_fourth_char as a helper function.

Open the files "examples/earlybird.txt" and "examples/fourth.txt" and look at the contents. Then try running your function with old_filename set to each of these filenames. Remember to set new_filename to a different value to avoid overwriting other files. Note that it is not required that you formally test this function, but you should convince yourself that it is correct.

You should not consider the newline at the end of a string to be a character. You can eliminate these from a line with the string rstrip ("right strip") method. Rather than passing each line to every_fourth_char, you might prefer to pass line.rstrip("\n")

Note: Open the file to read using the latin character set (this is an option in the file open command). For example:

f = open(file_name,'r',encoding='latin-1')

3. `num_characters(string)`

This function takes a single parameter string (type str). It returns the number of non-whitespace characters in string.

For example:

   num_characters("Happy haPpy     day!   !")

should evaluate to 15 (type int).

Hint: Remember that the string module in Python provides pre-defined strings with names such as whitespace and punctuation and ascii_lowercase. This means, for example, that the following code will not print anything.

from string import *
for chara in ascii_lowercase:
    if chara == 'K':
        print("found K")
        break

You might try implementing an is_whitespace(char) helper function.

Once you believe your function is correct, you must define test cases for this function in the file text_processing_tester.py to thoroughly test this function.

4. `count_characters(filename)`

This function takes a single parameter filename (type str). It returns the number of non-whitespace characters in that file.

Try running this function with filename/alice.txt to count the number of non-whitespace characters in Lewis Carroll's Alice in Wonderland. It is not necessary to formally test this function, but you should convince yourself that it is correct. My program found 121,981 non-whitespace characters in examples/alice.txt.

5. Looking forward to part 2

For Part 2 you will do something a little bit trickier than num_characters. Rather than being asked to count all of the non-whitespace characters in a string, you will be asked to count the instances of a particular character, and to do so independently of whether it appears in upper or lower case.

Then consider the following poem by E. E. Cummings.

l(a

le
af 
fa

ll 

s) 
one
l

iness

If the user asks to count the character "a", how many instances do you see? How would you modify num_characters algorithm to do this?

Part 2: More text processing

For this assignment you will implement something similar to wc, a Unix command that "displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output."

In particular, you will implement 2 additional functions wih the names and the parameters described below (with the parameters in the order described). Note that the implementation for these three functions must loop character by character through the input string. You may not, for example, use the find or count string methods.

In addition, the main() function is a third function that must do exactly what is described. And there are other requirements as well, so please read and double-check your work carefully!

Incremental development and thinking through the order in which you will implement and test things will be very important. Here is one potential order:

Implement and test num_characters (you need to do this before leaving lab)
Implement and test count_char
Implement and test num_words
Add code that asks the user for a letter and checks to ensure that they only provide one letter (keep asking if they don't).
Test to make sure that works and that it interacts correctly with the other three functions.
Add code that asks the user to enter lines repeatedly until they enter a -1. Test that your code correctly counts the number of lines, words, characters.
Add code for taking input from a file. Test it.

`count_char(string, char)`

This function takes two parameters. Both have type str. The second argument is additionally guaranteed to be only a single character (guaranteeing that this is true is the responsibility of whoever calls this function).

This function returns the number of times the character char appears in string. The function should ignore capitalization. For example:

   count_char("Happy happy haPPY", "y")
   count_char("HAPPY HAPPY HAPPY", "y")
   count_char("HAPPY HAPPY HAPPY", "Y")

should all return 3 (type int).

Once you believe your implementation is correct, add test cases to the test file text_processing_tester.py to thoroughly test your function.

`num_words(string)`

This function takes a single parameter (type str). It returns the number of words in string (return type is int). In this case a word is defined as a consecutive sequence of non-whitespace characters.

num_words("Happy haPpy day! !") num_words(" Happy haPpy day! ")

should evaluate to 4 and to 3 respectively.

Once you believe your implementation is correct, add test cases to the test file text_processing_tester.py to thoroughly test your function.

`main()`

To put this all together, your program's main function should:

Ask the user for a character char to count. If the user does not enter a single letter, ask them again. Continue until they enter a character.
Ask the user if they want to run in file or interactive mode by asking them to enter a 1 for file mode and a 0 for interactive mode.
If the user is running in interactive mode: ask the user to enter a line of text or a -1 if they are done. Continue until the user enters a -1.
If the user is running in file mode: ask the user for a filename.
Print the following set of statistics either about the lines of text entered by the user in interactive mode, or about the file if in file mode:
- The total number of lines
- The total number of words
- The total number of non-whitespace characters
- The number of times char appears, ignoring case
- The average length of a word (number of non-whitespace characters divided by the number of words)
- The percentage of char (number of times char appears divided by the number of non-whitespace characters, times 100).
- Note that the output should be in the format shown in the sample runs below.

Note: Since main functions don't have parameters (and return None) we can't use our test file to thoroughly test our main function. But you should still convince yourself that your function is correct.

Going above and beyond

Some suggestions if you want to do more:

The description above treats punctuation as letters (e.g. they are included in words and character counts), when it might be better to ignore them. Write new functions that ignore punctuation.
Currently the average word length and percentage of a letter contain much more precision than you really need. Use the information on Formatted string literals at this link to format those numbers a little more nicely.

Note that you should not change the functionality to the named functions above (since we'll be testing them to make sure they meet the specifications as described).

Instead, you should add new functions with different names that do different things and then describe these functions in that multiline comment at the top of your file submission. I would also recommend having a main2() function which, when executed, uses your new and improved functions.

Incremental development and testing will be critical! Note the suggested implementation order above.

Sample run

single letter to count: asdf you must enter a single letter! single letter to count: 2345 you must enter a single letter! single letter to count: ! you must enter a single letter! single letter to count: e

enter 1 for file or 0 for interactive 0 input line or -1 to stop: A is for apple input line or -1 to stop: B is for banana input line or -1 to stop: C is for canteloupe input line or -1 to stop: -1 * statistics * 3 lines 12 words 38 non-whitespace characters 3 e's average word length is: 3.1666666666666665 percentage e's is: 7.894736842105263

And an example of running in file mode:

single letter to count:
    p
enter 1 for file or 0 for interactive
    1
filename?:
    input1.txt

* statistics *

7 lines
66 words
386 non-whitespace characters
12 p's

average word length is: 5.848484848484849
percentage p's is: 3.1088082901554404

Coding Style

Make sure that your program is properly commented and uses appropriate Python style:

You should include meaningful one-line comments that explain what your code is doing.
Follow naming conventions, e.g. all variables and functions should be lowercase with underscores separating words.
Use good (mnemonic) variable names.
Use whitespace well, including indenting and use of blank lines to separate chunks of code that belong together.
Good use of loops.
Good use of conditionals (like last week).

Submission

For this lab you are required to submit the following files:

text_processing.py a python file that contains the implementation of all the required functions as specified.
text_processing_tester.py a python file that contains test cases for your functions.

Note that we will deduct points if your files are incorrectly named, if you do not include your names in the correct place, or if you do not include both files in your last submission. Please double and triple check this before submitting!