# Data Analysis The goal of this assignment is to familiarize you with data analysis and visualization. You'll practice handling files in comma-separated-value (csv) format, create and manipulate Python dictionaries, and do some basic plotting using the `matplotlib` package. | Part | Section | |---------------|-----------------------------------------------| | 1 (in-lab) | [Part 1: Data Processing](#part1) | | 1 (in-lab) | [Check-in Instructions](#checkin) | | 2 (lab/home) | [Part 2: Data analysis and visualization](#part2)| | 2 (lab/home) | [Submission Instructions](#submission) | ## Getting Started Create a new project named `Stocks` in the `CSCI051p-Workspace` you created on your Desktop. *Double check that you are creating the project in the right place, or you will likely have trouble finding your files later.* Then download the [starter code](stocks.zip). You should see a folder named `starter` that contains two python files `stock_rain.py` and `stock_rain_tester.py` and one subfolder `csvs` that includes a few csv files. Copy the two python files and the subfolder into the (recently created) `CSCI051p-Workspace/Stocks` folder. If you don't see all the new files, ask PyCharm to rescan that folder by clicking the triangle next to that folder (on the left-side list) to close and re-open it. The newly added stuff (`stock_rain.py`, `stock_rain_tester.py` and `csvs`) should now be visible. **Note:** If you are working on your own computer, you might need to manually install the matplotlib package. You can do so following the same method as you used to [install the Arcade package](../A3/install.html). <a name="part1"></a> ## Part 1: Data Processing There are several published papers that study the hypothesis that weather affects stock prices. One example is "Stock Prices and Wall Street Weather" [[1]](#ref). A more recent example is "Weather-Induced Mood, Institutional Investors, and Stock Returns"[[2]](#ref). For this assignment, you will use publicly-available stock data and weather data to evaluate whether or not this hypothesis is true. <!-- data from www.kaggle.com --> As a first step, you will need to process the data from the files in your dataset to create Python dictionaries that make it easy to work with the useful parts. To do this you'll write two functions: one for the file with stock price information and one for the file with rainfall information. #### Function 1: parse_rainfall(fname) ##### Weather data format Part of the provided dataset constitutes weather data for Seattle. The complete weather dataset is named `rainSeattle-1948-2017.csv`. There's also a subset of that data in a file named `rainSeattle-2012-2017.csv`. And, finally, the file `rainTest.csv` is exactly what's shown in the example below. The files contain a single header line with the names of the columns: ``` "DATE","PRCP","TMAX","TMIN","RAIN" ``` This is followed by lines of data in the following format: ``` "2012-01-01",0,55,41,"FALSE" "2012-01-02",0.43,51,37,"TRUE" "2012-01-03",0.03,53,45,"TRUE" "2012-01-04",0.8,54,42,"TRUE" "2012-01-05",0.05,48,37,"TRUE" "2012-01-06",0.1,40,36,"TRUE" "2012-01-07",0,45,37,"FALSE" "2012-01-08",0,50,37,"FALSE" "2012-01-09",0.17,49,41,"TRUE" ``` The first column is the date. The second is the amount of precipitation in inches, the third and fourth are the high and low temperatures of that day respectively, and the last column is TRUE if and only if it rained that day. If there's no data for a particular day, that entry will contain "NA" (including the quotes!). For example, the following line is in the weather data: ``` "1998-06-02","NA",72,52,"NA" ``` ##### Processing the weather data Write a function `parse_rainfall` that takes one parameter. The parameter `fname` is a string that is the name of the file. The function `parse_rainfall` returns a dictionary. The keys of the dictionary are strings that represent dates, such as "2012-01-01". The values in the dictionary are floats that represent the total precipitation recorded for that date in Seattle. It will look something like this: ``` { '2012-01-01': 0 '2012-01-02': 0.43 '2012-01-03': 0.03 '2012-01-04': 0.8 '2012-01-05': 0.05 '2012-01-06': 0.1 '2012-01-07': 0 '2012-01-08': 0 '2012-01-09': 0.17 } ``` If no precipitation data is available for a particular date, don't create an entry. #### Function 2: parse_stock(fname, sym) ##### Stock data format We've put three sample files in the usual place for our starters. The complete dataset is named `stocks-2006-2017.csv` and contains approximately 100,000 lines of text. The file `GOOGL-2012-2017.csv` contains just the stock information for Google from January 2012 through December 2017. The file `stockTest.csv` is the first few lines of the Google data and is exactly what's shown in the example below. The files contain a single header line with the names of the columns: ``` Date,Open,High,Low,Close,Volume,Name ``` This is followed by lines of data in the following format: ``` 1/3/12,326.8,334.41,326.51,333.04,3677850,GOOGL 1/4/12,332.85,335.46,330.64,334.47,2864365,GOOGL 1/5/12,331.4,332.32,328.44,329.83,3283277,GOOGL 1/6/12,329.9,330.33,325.22,325.34,1438122,GOOGL 1/9/12,323.57,323.82,310.93,311.54,5825720,GOOGL ``` The first column is the date and the last column is the stock symbol for the company that this line of data is for. The second column is the opening price of the stock, and third is the highest price it reached during that day, the fourth is the lowest price it reached during that day, the fifth is the closing price, and the sixth is the volume traded. If there's no data for some reason, that entry will be blank. For example, the following line is in the full stock file: ``` 7/31/17,,201.66,,201.17,1833625,MMM ``` ##### Processing the stock data Write a function `parse_stock` that takes two parameters. The first `fname` is a string that is the name of a file with stock information in the format specified above. The second `sym` is a string that is a stock symbol. The function `parse_stock` returns a dictionary. The keys of the dictionary are strings that represent dates, such as "2018-03-31". The values in the dictionary are floats that represent the day's _change in price_, defined as Close minus Open. It will look like this: ``` { '2012-01-03': 6.24, '2012-01-04': 1.62, '2012-01-05': -1.57, '2012-01-06': -4.56, '2012-01-09': -12.03 } ``` Each row from the input file is stored in the dictionary if and only if its symbol matches the value given as `sym`. It follows that if sym is a symbol not present in the input data, the return value is an empty dictionary. If no stock open, close data is available for a particular date, don't create an entry in the dictionary. Helpful hint: You may want to write a helper function that translates dates from the format in the stock input files to the format expected for the keys in the dictionary you're creating. <a name="checkin"></a> #### Checking In When you complete Part I, check in with a TA to collect your lab points! Before finding a TA or professor, make sure your functions have: - appropriate docstrings - good algorithm comments - mnemonic variable names - good use of horizontal and vertical white space We will double check your code, ask you a few questions about it, answer any questions you have, and award your points for Part 1. This must be completed before leaving the lab. After that you should start working on Part 2. <a name="part2"></a> ## Part II: Data analysis and visualization To perform data analysis and visualization, you will write three functions. First, you will write a function that combines two dictionaries (created by processing the input files using your functions from Part 1) into a single list containing precisely the data you want to plot. #### Function 3: correlate_data(stock_dict, rain_dict) The `correlate_data` function takes 2 parameters. The first `stock_dict` is a dictionary mapping date strings to floats representing daily changes in stock prices. The second `rain_dict` is a dictionary mapping date strings to floats representing rainfall daily totals. Coincidentally (or not), these are exactly the types of dictionaries created by the two functions described in Part I. The `correlate_data` function returns a list. Each item in the list is a list of size 2, which represents a `[stock_price_change, rainfall]` pair. Note that although we used dates as dictionary keys, and although we will need to use the dates to find corresponding stock/rain pairs, the list that is returned does not contain any dates. Helpful hint: Take care that your output list contains only pairs that were created from dates that exist in both input dictionaries. If your `stock_dict` and `rain_dict` dictionaries appeared as in the examples above, then the output of `correlate_data` should be the following list of lists: ``` [ [ 6.24, 0.03 ], [ 1.62, 0.8 ], [ -1.57, 0.05 ], [ -4.56, 0.1 ], [ -12.03, 0.17 ] ] ``` The next task is to create a scatter plot that compares rainfall and price deltas. For this, you will write one more function: #### Function 4: scatter_plot(data, format, name, done) The `scatter_plot` function takes 4 parameters. The first `data` is a list where each entry is a list of size 2. The second `format` is a matplotlib format string. The third `name` is a string that represents the stock whose data is being passed in. And the fourth `done` is a boolean that is `True` if and only if this is the last plot. When `scatter_plot` runs successfully, it plots the data in `data`. In particular, the function should plot the hypothesized cause (rain) on x-axis and the hypothesized effect (stock price change) on y-axis. However, it will only display the plot (and the legend) if `done` is `True`. Hint: The purpose of the done parameter is to determine when to call `plt.legend()` and `plt.show()`, since nothing can happen after that. Also note that the format of the `data` parameter is the same as the return type of `correlate_data`. #### Function 5: main() Finally, you must include a `main()` function. First, the `main()` function will ask the user to input a rainfall data file, a stock data file and two stock symbols. Then, the `main()` function calls `parse_stock` and `parse_rainfall` before also calling `correlate_data` and `scatter_plot`. Since the motivating question for this assignment involved technology companies headquartered in Seattle vs. elsewhere, there must be exactly two stocks plotted by your main function. The first should be either Microsoft (MSFT) or Amazon (AMZN), since they are both primarily located in Seattle. The other should be another technology company that is primarily located elsewhere. #### Sample session ``` Enter the name of a rainfall data file: csvs/rainSeattle-2012-2017.csv Enter the name of a stock data file: csvs/stocks-2006-2017.csv Enter a first stock symbol (e.g. MSFT or AMZN): MSFT Enter a second stock symbol (not head-quartered in Seattle): IBM ``` After which the program displays a plot like ... <img src="figures/MSFT_IBM.png" alt="Rainfall vs Price Change for MSFT and IBM"> #### Details, Hints, and Suggestions - Incremental development and testing will be helpful. The strategy from lab is to start by writing the function headers and docstrings for each function and then writing and testing them one at a time. - Individual functions can (should) also be developed incrementally. For example, if you're working on the `parse_rainfall` function, make sure your code correctly opens the file and that it correctly identifies the two pieces of information on each line that are of interest. Then make sure it correctly creates dictionary entries for the case where there's no missing data. Finally make sure it works for the case where some data is missing. - You can open a CSV file in Microsoft Excel, Google Sheets, or any text editor to examine its contents. - Please make sure the following block is included in your final submission. ``` if __name__ == "__main__": main() ``` #### Testing We suggest you first use `stockTest.csv` and `rainTest.csv` to test the basic functionality of the functions you developed. More importantly, you need to use the other large csv files (in the starter) to test your program, as these files will include all the possible format of data or corner cases that your program should consider. We will be testing your code using an autograder, which means you really, really, really need your function names, parameters, and outputs to match the specifications given above. You can use the `stock_rain_tester.py` file in the starter as a simple check (that said, our actual tests for grading will be different and more extensive . . . so please do additional testing of your code as well!). #### Coding Style Make sure that your program is properly commented: * You should have comments at the very beginning of the file stating your name, course, assignment number and the date. * Each function should have an appropriate docstring, describing: - the purpose of the function - the types and meanings of each parameter - the type and meaning of the return value(s) * Include other comments as necessary to make your code clear In addition, make sure that you have used good style. This includes: * Following naming conventions, e.g. all variables and functions should be lowercase. * Using good (mnemonic) variable names. * Proper use of whitespace, including indenting and use of blank lines to separate chunks of code that belong together. For more detailed descriptions, please review the [Python Coding Style Guidelines](../../python_style.html). ## Part 3: Feedback Create a file named `feedback.txt` that answers the usual questions: 1. How long did you spend on this assignment? 2. Any comments or feedback? Things you found interesting? Things you found challenging? Things you found boring? <a name="submission"></a> ## Submission For this lab you are required to submit two files: - `stock_rain.py` a python file that contains all the implementation as required above. - `feedback.txt` a text file containing your feedback for this assignment. These should be submitted using [submit.cs.pomona.edu](http://submit.cs.pomona.edu) as described in the general [submission instructions](../../submit.html). Note that we reserve the right to give you no more than half credit if your files are named incorrectly and/or your function headers do not match the specifications (including names, parameter order, etc). Please double and triple check this before submitting! ## Grade Point Allocations | Part | Feature | Value | |-----------|-------------------------------------------|-----| | Lab | Check-in | 3 | | | | | | Execution | `parse_stock` | 4 | | Execution | `parse_rainfall` | 4 | | Execution | `correlate_data` | 10 | | Execution | `scatter_plot` | 10 | | Execution | `main` | 2 | | | | | | Style | Docstrings accurate, relevant | 2 | | Style | Comments in code accurate, relevant | 2 | | Style | Good use of conditionals and loops | 2 | | Style | Good coding style, e.g., varaible names and whitespaces | 2 | | Style | Misc | 2 | | | | | | Feedback | Completed feedback file submitted | 2 | <a name="ref"></a> [1] Stock Prices and Wall Street Weather, by Edward M. Saunders, Jr. The American Economic Review, v83 n4 (Dec 1993) pp. 1337-1345 [2] Weather-Induced Mood, Institutional Investors, and Stock Returns, by William N. Goetzmann, Dasol Kim, Alok Kumar, and Qin Wang. The Review of Financial Studies. v28 n1 (Jan 2015) pp. 73-111 <!--[3] Thanks to Mikey Dickerson for helpful conversations! -->