For this assignment we will build a Naive Bayes text sentiment classifier that is based on online reviews and predicts whether a given review is positive or negative. As we've done with previous assignments, we're going to guide you through implementing the Naive Bayes classifier by asking you to write a collection of functions.
Our trained model will consist of two dictionaries, one representing positive examples and one representing negative examples. The entries in the dictionaries will have the key be a word and the corresponding value to be the p(word | label). To calculate these all you'll need to do is iterate through each file and count how many times a word occurs, then divide it by the total number of training examples (that is the total number of lines in the file), i.e.: \[ p(word | positive) = \frac{\mbox{how many } positive \mbox{ examples contained } word}{\mbox{the total number of } positive \mbox{ examples}} \]
We have set up the data so this really comes down to calculating: \[ p(word | positive) = \frac{\mbox{how many times } word \mbox{ occurred in the } positive \mbox{ file}}{\mbox{ the number of lines in the } positive \mbox{ file}} \]
Once we have our two dictionaries of probabilities, we'll be ready to classify new examples. Given a new review to classify (\(w_1, w_2, ..., w_m\)) we'll calculate the probability of that review based on the model as: \[ p(label | w_1, w_2, ..., w_m) \approx p(w_1|label) * p(w_2|label) * ... * p(w_m|label) = \prod_{i=1}^m p(w_i|label) \]
This means that we'll multiply each of the word probabilities given the label for each word in the review. We'll do this for both classes (positive and negative) and then classify the review as the label with the highest probability.
For this assignment, I've put together a collection of reviews from rateitall. The reviews come from a variety of domains including movies, music, books, and politicians. The original data have ratings from 1 to 5, but I have simplified these scores leaving only 1s and 5s: those with a 5 score are "positive" and those with a 1 are "negative".
To start on the assignment, download the starter package and copy the files to a week6
folder. The dataset contains three pairs of files. For each pair, the .positive
file contains all of the positive reviews and the .negative
all of the negative reviews. To help in testing and debugging, I've provided simple.positive
and simple.negative
, which only contain three examples per file and will be useful for illustrating how different functions work. train.positive
and train.negative
contain the text examples that you should use to train your model. If you open up train.positive
you'll see the first few examples are:
perhaps the perfect action thriller movie . funny , suspenseful dramatic packed has it all a well constructed thriller . my favorite action movie yet ! i love this movie ! bought it on dvd when came out . it's better than the original , and jamie lee lindsey were great
To make life simpler for you, I've already preprocessed the data so that all you need to do is count word occurrences. Specifically:
To evaluate our model, we'll use test.positive
and test.negative
.
Implement the following functions in bayes.py
that will build up our Naive Bayes classifier functionality.
get_file_counts
that takes as input a filename and returns a Python dictionary with the number of times each word occurred in that file.
Each line in the file will contain an example. I've already done all of the preprocessing for you, so just use the split
method to split a line up into its individual words. For example:
>>> get_file_counts("simple.positive") {'i': 3, 'loved': 3, 'it': 2, 'that': 2, 'movie': 1, 'hated': 1}
counts_to_probs
that takes a dictionary and a number and generates a new dictionary with the same keys where each value has been divided by the input number. For example:
>>> counts = get_file_counts("simple.positive") >>> counts_to_probs(counts, 3) {'i': 1.0, 'loved': 1.0, 'it': 0.66, 'that': 0.66, 'movie': 0.33, 'hated': 0.33}
Note that if you call this function with the word counts from a file and the number of lines int the file you'll get back the probabilities of each word (which is what we need!).
train_model
that takes as input a filename containing examples and returns a dictionary with the word probabilities (\(p(\mathit{word}|\mathit{label})\)). This should be a very short function that mostly just uses the previous two functions.get_probability
that takes as input two parameters: a dictionary of word probabilities and a string (representing a review). It should return the probability of that review by multiplying the probabilities of each of the words in the review. A few notes:
split
).For example:
>>> pos_model = train_model("simple.positive") >>> get_probability(pos_model, "I hated that class") 2.02020202020202e-05
The answer you get is:
p(i | pos) * p(hate|pos) * p(that | pos) * p(class|pos) = 1.0 * 0.333 * 0.666 * 0.00009 = 0.00002
The first three words are found and the fourth is not so it is assigned the constant 1/11000 probability.
classify
that takes three inputs: a string representing a review, the positive model (a dictionary of word probabilities), and the negative model (another dictionary of word probabilities). The function should return "positive" or "negative" depending on which model has the highest probability for the review. Ties should go to positive. Again, most of the work should be done by the functions you defined earlier.sentiment_analyzer
that takes two files as input: a positive examples file and a negative examples file, in that order. The function should train a positive and negative model using these files and then repeatedly ask the user to enter a sentence and then output the classification of that sentence (as positive or negative). A blank line/sentence should terminate the function. For example:
>>> sentiment_analyzer("train.positive", "train.negative") Blank line terminates. Enter a sentence: I like pizza positive Enter a sentence: I hate pizza negative Enter a sentence: I slipped on a banana positive Enter a sentence: I slipped on a bad banana negative Enter a sentence: computer science positive Enter a sentence: >>>
Now that we have a working model, we should figure out how good it is. First, we can use a quantitative measure. To do this, we're going to classify our test
examples and calculate what proportion we get right (called the accuracy).
get_accuracy
that takes four files as input in this order: The positive test file, the negative test file, the positive training file, and the negative training file.
The function should train the model (both positive and negative counts) and then classify all of the test examples (both positive and negative) and keep track of the accuracy of the model. The function should print out three scores: the accuracy on the positive test examples, the accuracy on the negative test examples, and the accuracy on all of the test examples. For example (I've hidden the actual values printed out since I want you to be surprised when you get your code running!):
>>> get_accuracy("test.positive", "test.negative", "train.positive", "train.negative") Positive accuracy: 0.##### Negative accuracy: 0.##### Total accuracy: 0.#####
Advice:
print()
to help you understand what your code is doing, but remove any prints before submitting your work.simple.positive
and simple.negative
both as the test and training sets. You should see perfect positive accuracy and 2/3 negative accuracy. It's not a good way to evaluate a model but it is a good way to make sure your code works!get_accuracy
on the test examples and a discussion of these results.Upload your python file to Gradescope, but not the positive/negative training files.