# Final Project The goal of the final project is to demonstrate what you have learned in this course and to explore how your new skills can be applied to solve real-world problems in the domain of your choice. The final project may be completed either individually or in pairs. If you are working as a pair, you must follow the rules of pair programming. To complete this project, you will need to: 1. Propose a project. This involves selecting a dataset and formulating three or more questions that you could answer using the data in your chosen dataset. Your project proposal will be due on Sunday, November 24 at 5pm. You will then meet with your lab instructor during lab on November 25 or November 26 to discuss your proposal. 2. Implement code that analyses your data in a way that provides an answer to your questions. Your code should produce both numeric data and visualizations. 3. Write a brief (approximately one page) report summarizing your questions, your data, and your results. 4. Submit a feedback file for the final project. ## Part I: Project Proposal (Due: November 24, 2019) The first step of this project will be to find a dataset and to formulate three or more possible questions that you could evaluate with this data. It is fine to use multiple datasets if necessary to evaluate your questions. We encourage you to propose a project in a domain of interest to you. [Kaggle](https://www.kaggle.com/datasets) is an excellent source of datasets. You will also find that many professors have datasets related to their research interests that they might be willing to share with you. If you are interested in a particular topic and can't find a suitable dataset, please reach out to the course staff during office hours or on piazza, and we will try to help you find one. We recommend that you use a dataset that is in csv format and that is not too big (we'd recommend not going above about 100MB). Once you have found a dataset or datasets you are interested in, formulate three (or more) possible questions that you could answer with this data. Make sure that you will be able to answer your questions with the data you have! Then write up a brief project proposal (a few sentences is fine) describing (1) what dataset(s) you will be using, including what columns are included in the dataset(s) and a link to the dataset(s), and (2) what your questions are. Your project proposal should be submitted on cs.submit by **5pm on November 24, 2019.** #### Example Projects **Airbnb.** Airbnb is a company which runs an online marketplace that lets people rent out their house, apartment, spare room, etc. A dataset on Airbnb rentals is available [here](http://tomslee.net/airbnb-data-collection-get-the-data). Each row of this dataset includes a variety of different information, including the roomid, hostid, city, room type, number of reviews, average satisfaction, number of people accomodated, number of bedrooms, price, and last date updated. Possible questions include: * Is there any correlation between the price of a listing and the overall satisfaction? * Do hosts typically have multiple listings at the same time? * How do the prices of a rental change over time? **Kickstarter.** Kickstarter is a crowdsourcing platform for funding creative projects. A dataset on Kickstarter projects is available [here](https://www.kaggle.com/kemical/kickstarter-projects). Each row of this dataset includesa variety of different information, including the category, the fundraising goal, the date launched, and the total amount pledged. Possible questions include: * Is there a correlation between the fundraising goal and the total amount pledged? * How has the rate of successful projects changed over time? * Which categories of project receive the most funding? **Bank Loans.** Banks maintain detailed records about loans on order to track current loans and in order to decide rates for future loans. A dataset on loans is available [here](https://www.kaggle.com/mrferozi/loan-data-for-dummy-bank). Each row of this dataset includes issue date, annual income, loan amount, loan term, and interest rate. Possible questions include: * How have interest rates changed over time? * Is there a correlation between annual income and interest rate for the same term loan? * Which states have the highest ratio of bad to good loans? **Forest Fires.** Forest fires are a common disaster in some parts of the world, including California. Forest fires can be caused by dry climate, lightning strikes, and human activities. A dataset on forest fires is available [here](https://www.kaggle.com/elikplim/forest-fires-data-set). Each row of this dataset includes a variety of different information, including the month when the fire occured, the weather conditions at the time (temperature, humidity, and windspeed), and the total area burned. Possible questions include: * At what time of year are forest fires most common? * Is there any correlation between weather conditions and the total area burned? * Is there a difference in the area burned depending on whether or not it is raining? **Movies.** Movies are both fun to watch, and a $100 billion dollar business. So both fans and studios collect data about movies. A dataset on movies from imdb is available [here](https://www.kaggle.com/PromptCloudHQ/imdb-data). This dataset includes movie rank, genre, director, lead actors, year, runtime, user rating, critic rating, and total revenue. Possible questions include: * Is there a correlation between critic rating and total revenue? Or between user rating and total revenue? * How has running length changed over time? * Who are the most bankable directors and movie stars? ## Part 2: Project Implementation Your project must implement a set of functions that use one or more datasets to answer each of three different questions. For each of your questions, you should implement two functions: one that returns numeric data that answers your question and one that produces a visualization of that data. *Your implementation must include a file named `finalproject.py` that contains a main function that (1) generate your three visualizations from the dataset(s) and saves them as files named `visualization1.png`, `visualization2.png`, and `visualization3.png`. Your main function should not take any inputs; all relevant information (e.g., filenames) should be hard-coded in the main function.* Your code should also include helper functions and tester code as appropriate, and should follow the rules of good style. If your code depends on any any external libraries that have not been used for prior assignments in this class, please make a note of those dependencies in the file-level docstring at the top of your `finalproject.py` file. *Note:* you can save plots using the function `plot.savefig(filename)`. #### Example Implementation For example, the set of functions implemented for the AirBnb example might be: * `price_satisfaction(filename)`: a function that takes a string containing the name of a file and returns a list of tuples of `float`s, where each tuple contains exactly two items: the price and the satisfaction. * `correlation(l)`: a function that takes a list that is the output of `price_satisfaction` and returns the correlation between the price and the rating using a Spearman's rank correlation. * `plot_price_satisfaction(l)`: a function that takes a list that is the output of `price_satisfaction` and produces a scatterplot of the data. * `host_listings(filename)`: a function that takes a string containing the name of a file and returns a dictionary where the keys are `host_id`s (`int`) and the values are a list of `room_id`s (`int`) associated with that host. * `num_listings(dict)`: a function that takes as input a dictionary that is in the format of the dictionary returned by `host_listings`}. It returns a list `l` where `l[i]` is the number of hosts that have exactly `i` listings. \item `plot_num_listings(l)`: a function that takes the output of `num_listings` and plots it. * `room_prices(filename_list, roomtype)`: a function that takes two parameters---a list of filenames (each a `str`) and returns a dictionary where the keys are `room_id`s (`int`) and the values are a list of the prices (`float`s) for that listing over time, from the oldest data to the most recent. * `price_change(dict)`: a function that takes as input a dictionary in the format returned from `room_prices` and returns a `tuple` with three elements (max percentage change, starting price of that property, ending price of that property). * `plot_price_change(dict)`: a function that takes as input a dictionary in the format returned from `room_prices` and produces a plot of the average price over time. ## Part 3: Write-Up Write a short report (one or two pages is fine) describing the questions you posed and the results you found. You will probably want to include the visualizations you produced. ## Part 4: Feedback Create a file named `feedback.txt` that answers the usual questions: 1. How long did you spend on this assignment? 2. Any comments or feedback? Things you found interesting? Things you found challenging? Things you found boring? <a name="submission"></a> ## Submission In addition to the project proposal, you are required to submit three files for your final project: - `fp.zip` a zip file that contains all the implementation as required above. At a minimum, this must include your `finalproject.py` file and the dataset(s) you use in your project. - `writeup.pdf` a pdf file that contains your write-up for your final project. - `feedback.txt` a text file containing your feedback for this assignment. These should be submitted using [submit.cs.pomona.edu](http://submit.cs.pomona.edu) as described in the general [submission instructions](../../submit.html). ## Grade Point Allocations | Part | Feature | Value | |-----------|-------------------------------------------|----| | Proposal | Written proposal & meeting with instructor| 3 | | | | | | Execution | Implements code to answer 3 questions | 45 | | Execution | Implements visualization functions | 15 | | Execution | Main function implements specification | 5 | | | | | | Style | Docstrings accurate, relevant | 4 | | Style | Comments in code accurate, relevant | 4 | | Style | Good use of conditionals and loops | 4 | | Style | Good coding style, e.g., variable names and whitespace | 4 | | Style | Appropriate use of helper functions | 2 | | Style | Files submitted correctly | 2 | | | | | | Writeup | Complete, accurate, and well written | 10 | | | | | | Feedback | Completed feedback file submitted | 2 |