# Text Processing
The goal of this assignment is to familiarize you with processing strings,
sequences and files, through exercises of looping over contents in string,
reading and writing files, etc.
| Part | Section |
|---------------|-----------------------------------------------|
| 1 (in-lab) | [Part 1: simple text processing](#part1) |
| 1 (in-lab) | [Check-in](#checkin) |
| 2 (lab/home) | [Part 2: more text processing](#part2) |
| 2 (lab/home) | [Submission Instructions](#submission) |
## Getting Started
First, decide whether or not you plan to work with a partner on this
assignment. Working with a partner is strongly encouraged.
However, regardless of whether you work alone or in a pair,
you are required to read about pair programming. If you have not already
done so, read the [instructions on pair programming](../../pair_programming.html), and decide
whether you want to work with a partner on this assignment.
If you decide to work with a partner, make sure you are
sitting next to each other. You will only need one computer. (But you will
probably want two chairs!)
One person should create a new project named `TextProcessing` in the
`CSCI051p-Workspace` you created on your Desktop. *Double check that you
are creating the project in the right place, or you will likely have
trouble finding your files later.* Then download the
[starter code](tp.zip). You should see a folder named `starter` that
contains two files (`text_processing.py` and `text_processing_tester.py`) and one subfolder (`examples`). Copy the two python files and the folders into
the (recently created) `CSCI051p-Workspace/TextProcessing` folder. If you don't see all
the new files, ask PyCharm to rescan that folder by clicking the triangle
next to that folder (on the left-side list) to close and re-open it. The newly
added stuff (`text_processing.py`, `text_processing_tester.py` and `examples`) should now be visible.
You should see a python
file named `text_processing.py`. Copy it into the (recently created)
`CSCI051p-Workspace/TextProcessing` folder.
## Part 1: simple text processing
In this part, you will write some functions that process text files.
#### 1. every_fourth_char
Define a function `every_fourth_char` that takes one parameter `string`
of type `str` and returns a new string constructed from every fourth letter. For
example `every_fourth_char("abcdefghijklmnopqrstuvwxyz")` should evaluate
to `"aeimquy"`.
Note: it is not required that you formally test this function, but you
should convince yourself that it is correct.
#### 2. copy_parts_of_file
Define a function `copy_parts_of_file` that takes two parameters (`old_filename`
and `new_filename`, both of type `str`), and creates a new file named
`new_filename` whose contents are the every fourth character of the
equivalent line in the file `old_filename`. For example, if the original
file contains the following Shel Silverstein poem:
```
Oh, if you're a bird, be an early bird
And catch the worm for your breakfast plate.
If you're a bird, be an early bird-
But if you're a worm, sleep late.
```
The new file should have the contents:
```
Oioeb, eyr
Acheroobkta
Ioeb, eyr
Bioew,el.
```
Your implementation should use `every_fourth_char` as a helper function.
Open the files `"examples/earlybird.txt"` and `"examples/fourth.txt"` and
look at the contents. Then try running your function with `old_filename`
set to each of these filenames. Remember to set `new_filename` to a new
value to avoid overwriting other files. Note that it is not required that
you formally test this function, but you should convince yourself that it
is correct.
You should not consider the newline at the end of a string to be a
character. You can eliminate these from a line with the string `rstrip`
method. Rather than passing each line to `every_fourth_char`, you
might prefer to pass `line.rstrip("\n")`
#### 3. num_characters(string)
This function takes a single parameter `string` (type `str`). It
returns the number of __non-whitespace__ characters in
`string`.
For example:
```
num_characters("Happy haPpy day! !")
```
should evaluate to 15 (type `int`).
*Hint:* Remember that the string module
(
https://docs.python.org/3/library/string.html) in Python
provides pre-defined strings with names such as `whitespace`
and `punctuation` and `ascii_lowercase`. This means,
for example, that the following code will print `False`.
```
from string import *
print("K" in ascii_lowercase)
```
Once you believe your function is correct, you must define test cases for
this function in the file `text_processing_tester.py` to thoroughly test
this function.
#### 4. count_characters(filename)
This function takes a single parameter `filename` (type `str`). It
returns the number of __non-whitespace__ characters in that file.
Try running this function with `filename/alice.txt` to count the number of
non-whitespace characters in Lewis Carroll's Alice in Wonderland. It is
not necessary to formally test this function, but you should convince
yourself that it is correct. My program found 121,981 non-whitespace
characters in `examples/alice.txt`.
#### 5. Looking forward to part 2
For Part 2 you will do something a little bit trickier than
`num_characters`. Rather than being asked to count all of
the non-whitespace characters in a string, you will be asked
to count the instances of a particular character ... and to do
so independently of whether it appears in upper or lower case.
Then consider the following poem by
E.E.Cummings.
https://en.wikipedia.org/wiki/L(a:
```
l(a
le
af
fa
ll
s)
one
l
iness
```
If the user asks to count the character "a", how many instances
do you see? How would you modify num_characters algorithm to
do this?
#### Checking In
Before finding a TA or professor, make sure your four functions work
correctly and are written using good coding style. In particular, make
sure your functions have:
- appropriate docstrings
- good algorithm comments
- mnemonic variable names
- good use of horizontal and vertical white space
We will double check your code, ask you a few questions about it, and
answer any questions you have. We will then ask you for the answers to the
Part 2 Example and answer any questions you might have about Part 2. We
will then award your points for Part 1.
This must be completed before leaving the lab.
After that you should start working on Part 2.
## Part 2: more text processing
For this assignment you will implement something similar to
`wc`, a Unix command that ''displays the number of lines,
words, and bytes contained in each input file, or standard input (if
no file is specified) to the standard output.''
In particular, you will implement 2 additional functions wih the
names and the parameters described below (with the parameters in the
order described). Note that the implementation for these three
functions __must__ loop character by character through the input
string. You __may not__, for example, use the `find` or
`count` string methods.
In addition, the `main()` function is a third function that
must do exactly what is described. And there are other requirements
as well, so please read and double-check your work carefully!
Incremental development and thinking through the order in which you
will implement and test things will be very important. Here is one
potential order:
- Implement and test `num_characters` (__you need to do this
before leaving lab__)
- Implement and test `count_char`
- Implement and test `num_words`
- Add code that asks the user for a letter and checks to
ensure that they only provide one letter (keep asking if they don't).
Test to make sure
that works and that it interacts correctly with the other three
functions.
- Add code that asks the user to enter lines repeatedly until
they enter a -1. Test that your code correctly counts the number
of lines, words, characters.
- Add code for taking input from a file. Test.
#### count_char(string, char)
This function takes two parameters. Both have type `str`.
The second argument is additionally guaranteed to be only a single
character (guaranteeing that this is true is the responsibility of
whoever calls this function).
This function returns the number of times the
character `char` appears in `string`. The function
should ignore capitalization. For example:
```
count_char("Happy happy haPPY", "y")
count_char("HAPPY HAPPY HAPPY", "y")
count_char("HAPPY HAPPY HAPPY", "Y")
```
should all return 3 (type `int`).
Once you believe your implementation is correct, add test cases to the test
file `text_processing_tester.py` to thoroughly test your function.
#### num_words(string)
This function takes a single parameter (type `str`). It
returns the number of __words__ in `string` (return type is
`int`). In this case a word is defined as a consecutive
sequence of non-whitespace characters.
```
num_words("Happy haPpy day! !")
num_words(" Happy haPpy day! ")
```
should evaluate to 4 and to 3 respectively.
Once you believe your implementation is correct, add test cases to the test
file `text_processing_tester.py` to thoroughly test your function.
#### main()
To put this all together, your program's `main` function should:
- Ask the user for a character `char` to count. If the
user does not enter a single letter, ask them again. Continue until they
enter a character.
- Ask the user if they want to run in file or interactive mode by
asking them to enter a 1 for file mode and a 0 for interactive mode.
- If the user is running in interactive mode: ask the
user to enter a line of text or a -1 if they are done. Continue
until the user enters a -1.
- If the user is running in file mode: ask the user for a filename.
- Print the following set of statistics either about the lines of
text entered by the user in interactive mode, or about the file if
in file mode:
* The total number of lines
* The total number of words
* The total number of non-whitespace characters
* The number of times `char` appears, ignoring case
* The average length of a word (number of non-whitespace
characters divided by the number of words)
* The percentage of `char` (number of times `char`
appears divided by the number of non-whitespace characters, times
100)
Note that the output should be in the format shown in the sample runs below.
*Note:* Since main functions don't have parameters (and return `None`) we
can't use our test file to thoroughly test our `main` function. But you
should still convince yourself that your function is correct.
#### Going above and beyond
Some suggestions if you want to do more:
- The description above treats punctuation as letters (e.g. they
are included in words and character counts), when it might be better
to ignore them. Write new functions that ignore punctuation.
- Currently the average word length and percentage of a letter contain
much more precision than you really need. Use the information on Formatted
string literals at
https://docs.python.org/3/reference/lexical_analysis.html#f-strings
to format those numbers a little more nicely.
Note that you should __not__ change the functionality to the named
functions above (since we'll be testing them to make sure they meet
the specifications as described).
Instead, you should add new functions with different names that do
different things and then describe these functions in that multiline
comment at the top of your file submission. I would also recommend
having a `main2()` function which, when executed, uses your new
and improved functions.
Incremental development and testing will be critical! Note the
suggested implementation order above.
#### Sample run
```
single letter to count:
asdf
you must enter a single letter!
single letter to count:
2345
you must enter a single letter!
single letter to count:
!
you must enter a single letter!
single letter to count:
e
enter 1 for file or 0 for interactive
0
input line or -1 to stop:
A is for apple
input line or -1 to stop:
B is for banana
input line or -1 to stop:
C is for cantelope
input line or -1 to stop:
-1
******** statistics ********
3 lines
12 words
38 non-whitespace characters
3 e's
average word length is: 3.1666666666666665
percentage e's is: 7.894736842105263
```
And an example of running in file mode:
```
single letter to count:
p
enter 1 for file or 0 for interactive
1
filename?:
input1.txt
******** statistics ********
7 lines
66 words
386 non-whitespace characters
12 p's
average word length is: 5.848484848484849
percentage p's is: 3.1088082901554404
```
#### Coding Style
Make sure that your program is properly commented:
* You should have comments at the very beginning of the file stating your name, course,
assignment number and the date.
* Each function should have an appropriate docstring, describing:
- the purpose of the function
- the types and meanings of each parameter
- the type and meaning of the return value(s)
* Include other comments as necessary to make your code clear
In addition, make sure that you have used good style. This includes:
* Following naming conventions, e.g. all variables and functions should be lowercase.
* Using good (mnemonic) variable names.
* Proper use of whitespace, including indenting and use of blank lines to separate chunks
of code that belong together.
For more detailed descriptions, please review the [Python Coding Style Guidelines](../../python_style.html).
## Part 3: Feedback
Create a file named `feedback.txt` that answers the usual questions:
1. How long did you spend on this assignment? Please include time spent during lab, including time spent on Part 1.
2. Any comments or feedback? Things you found interesting? Things you found challenging? Things you found boring?
## Submission
For this lab you are required to submit three files:
- `text_processing.py` a python file that contains the implementation of all the
required functions as specified.
- `text_processing_tester.py` a python file that contains test cases for
your functions.
- `feedback.txt` a text file containing your feedback for this assignment.
These should be submitted using [submit.cs.pomona.edu](http://submit.cs.pomona.edu)
as described in the general [submission instructions](../../submit.html).
Note that we reserve the right to give you no more than half credit if your files are
named incorrectly and/or your function headers do not match the specifications (including
names, parameter order, etc). Please double and triple check this before submitting!
## Grade Point Allocations
| Part | Feature | Value |
|-----------|-------------------------------------------|-----|
| Lab | Check-in | 3 |
| | | |
| Execution | correct one character count from `count_char`| 4 |
| Execution | correct number of words `num_words` | 8 |
| Execution | correct number of lines | 2 |
| Execution | statistics correctly calculated/printed | 6 |
| Execution | correctly asks for single char to count | 4 |
| Execution | works interactively | 4 |
| Execution | works with files | 4 |
| Testing | thoroughly tests `count_char` | 4 |
| Testing | thoroughly tests `num_words` | 4 |
| | | |
| Style | Using `for` loops with no forbidden string methods | 6 |
| Style | Files submitted correctly | 1 |
| Style | Docstrings in functions | 3 |
| Style | Comments in code relevant and appropriate | 2 |
| Style | Good use of variable names | 2 |
| Style | Good use of whitespaces | 2 |
| Style | Good use of loops and conditionals | 2 |
| Style | Misc | 2 |
| | | |
| Feedback | Completed feedback file submitted | 2 |