The goal of this assignment is to familiarize you with processing strings, sequences, and files through exercises involving looping over the contents of a string, and reading and writing files.
Feature | Value |
---|---|
Correct one-character count from count_char | 4 |
Correct number of words from num_words | 8 |
Correct number of lines | 2 |
Statistics correctly calculated/printed | 6 |
Asks for single character to count | 4 |
Works from console | 4 |
Works from files | 4 |
Thorough tests for count_char | 4 |
Thorough tests for num_words | 4 |
Uses forbidden string methods | -15 |
Good use of comments | 4 |
Good use of variable names | 4 |
Good use of whitespace | 4 |
Good use of loops and conditionals | 4 |
Create a week4
folder and open it in VS Code. Download
the starter package and
copy text_processing.py
, text_processing_tester.py
, and the examples
folder into your week4
folder. Read through the rest of this writeup and think about the following questions:
num_characters()
, when iterating the characters over a given text, how will you determine whether or not a character is a non-whitespace character? Try writing out a boolean expression on paper.num_words()
, how do you identify a word (defined as a consecutive sequence of non-whitespace characters) in a given text? Please explain in words.main()
, how do you calculate the average word length in a given text? Please explain in words.In this part, you will write some functions that process text files.
In class and elsewhere you may have been introduced to a number of string operations. In writing the code for this lab you should restrict yourself to the following:
len
s[3]
(but not general "slicing" with colons)lower()
,upper()
in
used as a for loop iterator (e.g. for chara in string
)rstrip()
as discussed belowThese limitations are to give you practice working with strings as iterables, rather than using Python's extensive built-in string processing utilities, to prepare us for working with collection types in the weeks ahead.
every_fourth_char
Define a function every_fourth_char
that takes one parameter string
of type str
and returns a new string constructed from every fourth letter. For
example every_fourth_char("abcdefghijklmnopqrstuvwxyz")
should evaluate
to "aeimquy"
. Write a test case for this function in text_processing_tester.py
.
copy_parts_of_file
Define a function copy_parts_of_file
that takes two parameters (old_filename
and new_filename
, both of type str
), and creates a new file named
new_filename
whose contents are the every fourth character of the
equivalent line in the file old_filename
. For example, if the original
file contains the following Shel Silverstein poem:
Oh, if you're a bird, be an early bird And catch the worm for your breakfast plate. If you're a bird, be an early bird- But if you're a worm, sleep late.
The new file should have the contents:
Oioeb, eyr Acheroobkta Ioeb, eyr Bioew,el.
Your implementation should use every_fourth_char
as a helper function.
Open the files "examples/earlybird.txt"
and "examples/fourth.txt"
and
look at the contents. Then try running your function with old_filename
set to each of these filenames. Remember to set new_filename
to a different
value to avoid overwriting other files. Note that it is not required that
you formally test this function, but you should convince yourself that it
is correct.
You should not consider the newline at the end of a string to be a
character. You can eliminate these from a line with the string rstrip
("right strip") method. Rather than passing each line to every_fourth_char
, you
might prefer to pass line.rstrip("\n")
Note: Open the file to read using the latin character set (this is an option in the file open command). For example:
f = open(file_name,'r',encoding='latin-1')
num_characters(string)
This function takes a single parameter string
(type str
). It
returns the number of non-whitespace characters in
string
.
For example:
num_characters("Happy haPpy day! !")
should evaluate to 15 (type int
).
Hint: Remember that the string
module in Python
provides pre-defined strings with names such as whitespace
and punctuation
and ascii_lowercase
. This means,
for example, that the following code will not print anything.
from string import * for chara in ascii_lowercase: if chara == 'K': print("found K") break
You might try implementing an is_whitespace(char)
helper function.
Once you believe your function is correct, you must define test cases for
this function in the file text_processing_tester.py
to thoroughly test
this function.
count_characters(filename)
This function takes a single parameter filename
(type str
). It
returns the number of non-whitespace characters in that file.
Try running this function with filename/alice.txt
to count the number of
non-whitespace characters in Lewis Carroll's Alice in Wonderland. It is
not necessary to formally test this function, but you should convince
yourself that it is correct. My program found 121,981 non-whitespace
characters in examples/alice.txt
.
For Part 2 you will do something a little bit trickier than
num_characters
. Rather than being asked to count all of
the non-whitespace characters in a string, you will be asked
to count the instances of a particular character, and to do
so independently of whether it appears in upper or lower case.
Then consider the following poem by E. E. Cummings.
l(a le af fa ll s) one l iness
If the user asks to count the character "a", how many instances do you see? How would you modify num_characters algorithm to do this?
For this assignment you will implement something similar to
wc
, a Unix command that "displays the number of lines,
words, and bytes contained in each input file, or standard input (if
no file is specified) to the standard output."
In particular, you will implement 2 additional functions wih the
names and the parameters described below (with the parameters in the
order described). Note that the implementation for these three
functions must loop character by character through the input
string. You may not, for example, use the find
or
count
string methods.
In addition, the main()
function is a third function that
must do exactly what is described. And there are other requirements
as well, so please read and double-check your work carefully!
Incremental development and thinking through the order in which you will implement and test things will be very important. Here is one potential order:
num_characters
(you need to do this
before leaving lab)count_char
num_words
count_char(string, char)
This function takes two parameters. Both have type str
.
The second argument is additionally guaranteed to be only a single
character (guaranteeing that this is true is the responsibility of
whoever calls this function).
This function returns the number of times the
character char
appears in string
. The function
should ignore capitalization. For example:
count_char("Happy happy haPPY", "y") count_char("HAPPY HAPPY HAPPY", "y") count_char("HAPPY HAPPY HAPPY", "Y")
should all return 3 (type int
).
Once you believe your implementation is correct, add test cases to the test
file text_processing_tester.py
to thoroughly test your function.
num_words(string)
This function takes a single parameter (type str
). It
returns the number of words in string
(return type is
int
). In this case a word is defined as a consecutive
sequence of non-whitespace characters.
num_words("Happy haPpy day! !")
num_words(" Happy haPpy day! ")
should evaluate to 4 and to 3 respectively.
Once you believe your implementation is correct, add test cases to the test
file text_processing_tester.py
to thoroughly test your function.
main()
To put this all together, your program's main
function should:
char
to count. If the
user does not enter a single letter, ask them again. Continue until they
enter a character.char
appears, ignoring casechar
(number of times char
appears divided by the number of non-whitespace characters, times 100).Note: Since main functions don't have parameters (and return None
) we
can't use our test file to thoroughly test our main
function. But you
should still convince yourself that your function is correct.
Some suggestions if you want to do more:
this link
to format those numbers a little more nicely.Note that you should not change the functionality to the named functions above (since we'll be testing them to make sure they meet the specifications as described).
Instead, you should add new functions with different names that do
different things and then describe these functions in that multiline
comment at the top of your file submission. I would also recommend
having a main2()
function which, when executed, uses your new
and improved functions.
Incremental development and testing will be critical! Note the suggested implementation order above.
single letter to count: asdf you must enter a single letter! single letter to count: 2345 you must enter a single letter! single letter to count: ! you must enter a single letter! single letter to count: e enter 1 for file or 0 for interactive 0 input line or -1 to stop: A is for apple input line or -1 to stop: B is for banana input line or -1 to stop: C is for canteloupe input line or -1 to stop: -1 * statistics * 3 lines 12 words 38 non-whitespace characters 3 e's average word length is: 3.1666666666666665 percentage e's is: 7.894736842105263
And an example of running in file mode:
single letter to count: p enter 1 for file or 0 for interactive 1 filename?: input1.txt * statistics * 7 lines 66 words 386 non-whitespace characters 12 p's average word length is: 5.848484848484849 percentage p's is: 3.1088082901554404
Make sure that your program is properly commented and uses appropriate Python style:
For this lab you are required to submit the following files:
text_processing.py
a python file that contains the implementation of all the
required functions as specified.text_processing_tester.py
a python file that contains test cases for
your functions. Note that we will deduct points if your files are incorrectly named, if you do not include your names in the correct place, or if you do not include both files in your last submission. Please double and triple check this before submitting!