Text Processing

# Text Processing The goal of this assignment is to familiarize you with processing strings, sequences and files, through exercises of looping over contents in string, reading and writing files, etc. | Part | Section | |---------------|-----------------------------------------------| | 1 (in-lab) | [Part 1: simple text processing](#part1) | | 1 (in-lab) | [Check-in](#checkin) | | 2 (lab/home) | [Part 2: more text processing](#part2) | | 2 (lab/home) | [Submission Instructions](#submission) | ## Getting Started First, decide whether or not you plan to work with a partner on this assignment. Working with a partner is strongly encouraged. However, regardless of whether you work alone or in a pair, you are required to read about pair programming. If you have not already done so, read the [instructions on pair programming](../../pair_programming.html), and decide whether you want to work with a partner on this assignment. If you decide to work with a partner, make sure you are sitting next to each other. You will only need one computer. (But you will probably want two chairs!) One person should create a new project named `TextProcessing` in the `CSCI051p-Workspace` you created on your Desktop. *Double check that you are creating the project in the right place, or you will likely have trouble finding your files later.* Then download the [starter code](tp.zip). You should see a folder named `starter` that contains two files (`text_processing.py` and `text_processing_tester.py`) and one subfolder (`examples`). Copy the two python files and the folders into the (recently created) `CSCI051p-Workspace/TextProcessing` folder. If you don't see all the new files, ask PyCharm to rescan that folder by clicking the triangle next to that folder (on the left-side list) to close and re-open it. The newly added stuff (`text_processing.py`, `text_processing_tester.py` and `examples`) should now be visible. You should see a python file named `text_processing.py`. Copy it into the (recently created) `CSCI051p-Workspace/TextProcessing` folder. <a name="part1"></a> ## Part 1: simple text processing In this part, you will write some functions that process text files. #### 1. every_fourth_char Define a function `every_fourth_char` that takes one parameter `string` of type `str` and returns a new string constructed from every fourth letter. For example `every_fourth_char("abcdefghijklmnopqrstuvwxyz")` should evaluate to `"aeimquy"`. Note: it is not required that you formally test this function, but you should convince yourself that it is correct. #### 2. copy_parts_of_file Define a function `copy_parts_of_file` that takes two parameters (`old_filename` and `new_filename`, both of type `str`), and creates a new file named `new_filename` whose contents are the every fourth character of the equivalent line in the file `old_filename`. For example, if the original file contains the following Shel Silverstein poem: ``` Oh, if you're a bird, be an early bird And catch the worm for your breakfast plate. If you're a bird, be an early bird- But if you're a worm, sleep late. ``` The new file should have the contents: ``` Oioeb, eyr Acheroobkta Ioeb, eyr Bioew,el. ``` Your implementation should use `every_fourth_char` as a helper function. Open the files `"examples/earlybird.txt"` and `"examples/fourth.txt"` and look at the contents. Then try running your function with `old_filename` set to each of these filenames. Remember to set `new_filename` to a new value to avoid overwriting other files. Note that it is not required that you formally test this function, but you should convince yourself that it is correct. You should not consider the newline at the end of a string to be a character. You can eliminate these from a line with the string `rstrip` method. Rather than passing each line to `every_fourth_char`, you might prefer to pass `line.rstrip("\n")` #### 3. num_characters(string) This function takes a single parameter `string` (type `str`). It returns the number of __non-whitespace__ characters in `string`. For example: ``` num_characters("Happy haPpy day! !") ``` should evaluate to 15 (type `int`). *Hint:* Remember that the string module (<a href="https://docs.python.org/3/library/string.html"> https://docs.python.org/3/library/string.html</a>) in Python provides pre-defined strings with names such as `whitespace` and `punctuation` and `ascii_lowercase`. This means, for example, that the following code will print `False`. ``` from string import * print("K" in ascii_lowercase) ``` Once you believe your function is correct, you must define test cases for this function in the file `text_processing_tester.py` to thoroughly test this function. #### 4. count_characters(filename) This function takes a single parameter `filename` (type `str`). It returns the number of __non-whitespace__ characters in that file. Try running this function with `filename/alice.txt` to count the number of non-whitespace characters in Lewis Carroll's Alice in Wonderland. It is not necessary to formally test this function, but you should convince yourself that it is correct. My program found 121,981 non-whitespace characters in `examples/alice.txt`. #### 5. Looking forward to part 2 For Part 2 you will do something a little bit trickier than `num_characters`. Rather than being asked to count all of the non-whitespace characters in a string, you will be asked to count the instances of a particular character ... and to do so independently of whether it appears in upper or lower case. Then consider the following poem by E.E.Cummings. <a href="https://en.wikipedia.org/wiki/L(a"> https://en.wikipedia.org/wiki/L(a</a>: ``` l(a le af fa ll s) one l iness ``` If the user asks to count the character "a", how many instances do you see? How would you modify num_characters algorithm to do this? <a name="checkin"></a> #### Checking In Before finding a TA or professor, make sure your four functions work correctly and are written using good coding style. In particular, make sure your functions have: - appropriate docstrings - good algorithm comments - mnemonic variable names - good use of horizontal and vertical white space We will double check your code, ask you a few questions about it, and answer any questions you have. We will then ask you for the answers to the Part 2 Example and answer any questions you might have about Part 2. We will then award your points for Part 1. This must be completed before leaving the lab. After that you should start working on Part 2. <a name="part2"></a> ## Part 2: more text processing For this assignment you will implement something similar to `wc`, a Unix command that ''displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output.'' In particular, you will implement 2 additional functions wih the names and the parameters described below (with the parameters in the order described). Note that the implementation for these three functions __must__ loop character by character through the input string. You __may not__, for example, use the `find` or `count` string methods. In addition, the `main()` function is a third function that must do exactly what is described. And there are other requirements as well, so please read and double-check your work carefully! Incremental development and thinking through the order in which you will implement and test things will be very important. Here is one potential order: - Implement and test `num_characters` (__you need to do this before leaving lab__) - Implement and test `count_char` - Implement and test `num_words` - Add code that asks the user for a letter and checks to ensure that they only provide one letter (keep asking if they don't). Test to make sure that works and that it interacts correctly with the other three functions. - Add code that asks the user to enter lines repeatedly until they enter a -1. Test that your code correctly counts the number of lines, words, characters. - Add code for taking input from a file. Test. #### count_char(string, char) This function takes two parameters. Both have type `str`. The second argument is additionally guaranteed to be only a single character (guaranteeing that this is true is the responsibility of whoever calls this function). This function returns the number of times the character `char` appears in `string`. The function should ignore capitalization. For example: ``` count_char("Happy happy haPPY", "y") count_char("HAPPY HAPPY HAPPY", "y") count_char("HAPPY HAPPY HAPPY", "Y") ``` should all return 3 (type `int`). Once you believe your implementation is correct, add test cases to the test file `text_processing_tester.py` to thoroughly test your function. #### num_words(string) This function takes a single parameter (type `str`). It returns the number of __words__ in `string` (return type is `int`). In this case a word is defined as a consecutive sequence of non-whitespace characters. ``` num_words("Happy haPpy day! !") num_words(" Happy haPpy day! ") ``` should evaluate to 4 and to 3 respectively. Once you believe your implementation is correct, add test cases to the test file `text_processing_tester.py` to thoroughly test your function. #### main() To put this all together, your program's `main` function should: - Ask the user for a character `char` to count. If the user does not enter a single letter, ask them again. Continue until they enter a character. - Ask the user if they want to run in file or interactive mode by asking them to enter a 1 for file mode and a 0 for interactive mode. - If the user is running in interactive mode: ask the user to enter a line of text or a -1 if they are done. Continue until the user enters a -1. - If the user is running in file mode: ask the user for a filename. - Print the following set of statistics either about the lines of text entered by the user in interactive mode, or about the file if in file mode: * The total number of lines * The total number of words * The total number of non-whitespace characters * The number of times `char` appears, ignoring case * The average length of a word (number of non-whitespace characters divided by the number of words) * The percentage of `char` (number of times `char` appears divided by the number of non-whitespace characters, times 100) Note that the output should be in the format shown in the sample runs below. *Note:* Since main functions don't have parameters (and return `None`) we can't use our test file to thoroughly test our `main` function. But you should still convince yourself that your function is correct. #### Going above and beyond Some suggestions if you want to do more: - The description above treats punctuation as letters (e.g. they are included in words and character counts), when it might be better to ignore them. Write new functions that ignore punctuation. - Currently the average word length and percentage of a letter contain much more precision than you really need. Use the information on Formatted string literals at <a href="https://docs.python.org/3/reference/lexical_analysis.html#f-strings"> https://docs.python.org/3/reference/lexical_analysis.html#f-strings</a> to format those numbers a little more nicely. Note that you should __not__ change the functionality to the named functions above (since we'll be testing them to make sure they meet the specifications as described). Instead, you should add new functions with different names that do different things and then describe these functions in that multiline comment at the top of your file submission. I would also recommend having a `main2()` function which, when executed, uses your new and improved functions. Incremental development and testing will be critical! Note the suggested implementation order above. #### Sample run ``` single letter to count: asdf you must enter a single letter! single letter to count: 2345 you must enter a single letter! single letter to count: ! you must enter a single letter! single letter to count: e enter 1 for file or 0 for interactive 0 input line or -1 to stop: A is for apple input line or -1 to stop: B is for banana input line or -1 to stop: C is for cantelope input line or -1 to stop: -1 ******** statistics ******** 3 lines 12 words 38 non-whitespace characters 3 e's average word length is: 3.1666666666666665 percentage e's is: 7.894736842105263 ``` And an example of running in file mode: ``` single letter to count: p enter 1 for file or 0 for interactive 1 filename?: input1.txt ******** statistics ******** 7 lines 66 words 386 non-whitespace characters 12 p's average word length is: 5.848484848484849 percentage p's is: 3.1088082901554404 ``` #### Coding Style Make sure that your program is properly commented: * You should have comments at the very beginning of the file stating your name, course, assignment number and the date. * Each function should have an appropriate docstring, describing: - the purpose of the function - the types and meanings of each parameter - the type and meaning of the return value(s) * Include other comments as necessary to make your code clear In addition, make sure that you have used good style. This includes: * Following naming conventions, e.g. all variables and functions should be lowercase. * Using good (mnemonic) variable names. * Proper use of whitespace, including indenting and use of blank lines to separate chunks of code that belong together. For more detailed descriptions, please review the [Python Coding Style Guidelines](../../python_style.html). ## Part 3: Feedback Create a file named `feedback.txt` that answers the usual questions: 1. How long did you spend on this assignment? Please include time spent during lab, including time spent on Part 1. 2. Any comments or feedback? Things you found interesting? Things you found challenging? Things you found boring? <a name="submission"></a> ## Submission For this lab you are required to submit three files: - `text_processing.py` a python file that contains the implementation of all the required functions as specified. - `text_processing_tester.py` a python file that contains test cases for your functions. - `feedback.txt` a text file containing your feedback for this assignment. These should be submitted using [submit.cs.pomona.edu](http://submit.cs.pomona.edu) as described in the general [submission instructions](../../submit.html). Note that we reserve the right to give you no more than half credit if your files are named incorrectly and/or your function headers do not match the specifications (including names, parameter order, etc). Please double and triple check this before submitting! ## Grade Point Allocations | Part | Feature | Value | |-----------|-------------------------------------------|-----| | Lab | Check-in | 3 | | | | | | Execution | correct one character count from `count_char`| 4 | | Execution | correct number of words `num_words` | 8 | | Execution | correct number of lines | 2 | | Execution | statistics correctly calculated/printed | 6 | | Execution | correctly asks for single char to count | 4 | | Execution | works interactively | 4 | | Execution | works with files | 4 | | Testing | thoroughly tests `count_char` | 4 | | Testing | thoroughly tests `num_words` | 4 | | | | | | Style | Using `for` loops with no forbidden string methods | 6 | | Style | Files submitted correctly | 1 | | Style | Docstrings in functions | 3 | | Style | Comments in code relevant and appropriate | 2 | | Style | Good use of variable names | 2 | | Style | Good use of whitespaces | 2 | | Style | Good use of loops and conditionals | 2 | | Style | Misc | 2 | | | | | | Feedback | Completed feedback file submitted | 2 |