Review Sentiment Analysis

For this assignment we will build a Naive Bayes text sentiment classifier that is based on online reviews and predicts whether a given review is positive or negative. As we've done with previous assignments, we're going to guide you through implementing the Naive Bayes classifier by asking you to write a collection of functions.

Training

Our trained model will consist of two dictionaries, one representing positive examples and one representing negative examples. The entries in the dictionaries will have the key be a word and the corresponding value to be the p(word | label). To calculate these all you'll need to do is iterate through each file and count how many times a word occurs, then divide it by the total number of training examples (that is the total number of lines in the file), i.e.: \[ p(word | positive) = \frac{\mbox{how many } positive \mbox{ examples contained } word}{\mbox{the total number of } positive \mbox{ examples}} \]

We have set up the data so this really comes down to calculating: \[ p(word | positive) = \frac{\mbox{how many times } word \mbox{ occurred in the } positive \mbox{ file}}{\mbox{ the number of lines in the } positive \mbox{ file}} \]

Classifying

Once we have our two dictionaries of probabilities, we'll be ready to classify new examples. Given a new review to classify (\(w_1, w_2, ..., w_m\)) we'll calculate the probability of that review based on the model as: \[ p(label | w_1, w_2, ..., w_m) \approx p(w_1|label) * p(w_2|label) * ... * p(w_m|label) = \prod_{i=1}^m p(w_i|label) \]

This means that we'll multiply each of the word probabilities given the label for each word in the review. We'll do this for both classes (positive and negative) and then classify the review as the label with the highest probability.

Dataset

For this assignment, I've put together a collection of reviews from rateitall. The reviews come from a variety of domains including movies, music, books, and politicians. The original data have ratings from 1 to 5, but I have simplified these scores leaving only 1s and 5s: those with a 5 score are "positive" and those with a 1 are "negative".

To start on the assignment, download the starter package and copy the files to a week6 folder. The dataset contains three pairs of files. For each pair, the .positive file contains all of the positive reviews and the .negative all of the negative reviews. To help in testing and debugging, I've provided simple.positive and simple.negative, which only contain three examples per file and will be useful for illustrating how different functions work. train.positive and train.negative contain the text examples that you should use to train your model. If you open up train.positive you'll see the first few examples are:

  perhaps the perfect action thriller movie . funny , suspenseful dramatic packed
  has it all

  a well constructed thriller .

  my favorite action movie yet !

  i love this movie ! bought it on dvd when came out . it's better than the
  original , and jamie lee lindsey were great
  

To make life simpler for you, I've already preprocessed the data so that all you need to do is count word occurrences. Specifically:

To evaluate our model, we'll use test.positive and test.negative.

Naive Bayes

Implement the following functions in bayes.py that will build up our Naive Bayes classifier functionality.

Training

  1. 3 points. Write a function called get_file_counts that takes as input a filename and returns a Python dictionary with the number of times each word occurred in that file.

    Each line in the file will contain an example. I've already done all of the preprocessing for you, so just use the split method to split a line up into its individual words. For example:

        >>> get_file_counts("simple.positive")
        {'i': 3, 'loved': 3, 'it': 2, 'that': 2, 'movie': 1, 'hated': 1}
        
  2. 2 points. Write a function called counts_to_probs that takes a dictionary and a number and generates a new dictionary with the same keys where each value has been divided by the input number. For example:
        >>> counts = get_file_counts("simple.positive")
        >>> counts_to_probs(counts, 3)
        {'i': 1.0, 'loved': 1.0, 'it': 0.66, 'that': 0.66, 'movie': 0.33, 'hated': 0.33}
          

    Note that if you call this function with the word counts from a file and the number of lines int the file you'll get back the probabilities of each word (which is what we need!).

  3. 1 point. Write a function called train_model that takes as input a filename containing examples and returns a dictionary with the word probabilities (\(p(\mathit{word}|\mathit{label})\)). This should be a very short function that mostly just uses the previous two functions.

Classifying

  1. 3 points. Write a function called get_probability that takes as input two parameters: a dictionary of word probabilities and a string (representing a review). It should return the probability of that review by multiplying the probabilities of each of the words in the review. A few notes:

    For example:

            >>> pos_model = train_model("simple.positive")
            >>> get_probability(pos_model, "I hated that class")
            2.02020202020202e-05
          

    The answer you get is:

        p(i | pos) * p(hate|pos) * p(that | pos) * p(class|pos) =
            1.0    *     0.333   *      0.666    *     0.00009  = 0.00002
          

    The first three words are found and the fourth is not so it is assigned the constant 1/11000 probability.

  2. 2 points. Write a function called classify that takes three inputs: a string representing a review, the positive model (a dictionary of word probabilities), and the negative model (another dictionary of word probabilities). The function should return "positive" or "negative" depending on which model has the highest probability for the review. Ties should go to positive. Again, most of the work should be done by the functions you defined earlier.
  3. 3 points. To make it easy to play with our model, write an interactive function called sentiment_analyzer that takes two files as input: a positive examples file and a negative examples file, in that order. The function should train a positive and negative model using these files and then repeatedly ask the user to enter a sentence and then output the classification of that sentence (as positive or negative). A blank line/sentence should terminate the function. For example:
            >>> sentiment_analyzer("train.positive", "train.negative")
            Blank line terminates.
            Enter a sentence: I like pizza
            positive
            Enter a sentence: I hate pizza
            negative
            Enter a sentence: I slipped on a banana
            positive
            Enter a sentence: I slipped on a bad banana
            negative
            Enter a sentence: computer science
            positive
            Enter a sentence:
            >>>
          

Evaluation

Now that we have a working model, we should figure out how good it is. First, we can use a quantitative measure. To do this, we're going to classify our test examples and calculate what proportion we get right (called the accuracy).

  1. 4 points. Write a function called get_accuracy that takes four files as input in this order: The positive test file, the negative test file, the positive training file, and the negative training file.

    The function should train the model (both positive and negative counts) and then classify all of the test examples (both positive and negative) and keep track of the accuracy of the model. The function should print out three scores: the accuracy on the positive test examples, the accuracy on the negative test examples, and the accuracy on all of the test examples. For example (I've hidden the actual values printed out since I want you to be surprised when you get your code running!):

            >>> get_accuracy("test.positive", "test.negative", "train.positive", "train.negative")
            Positive accuracy: 0.#####
            Negative accuracy: 0.#####
            Total accuracy: 0.#####
          

    Advice:

  2. 4 points. Include 1-2 short paragraphs (less than half a page, though), as either comments or a triple quoted string, at the end of your file evaluating the quality of your Naive Bayes model. Your discussion must include:

Submitting Your Work

Upload your python file to Gradescope, but not the positive/negative training files.