Guided Project: Analyzing movie reviews
1: Movie Reviews
In this project, you'll be working with Jupyter notebook, and analyzing data on movie review scores. By the end, you'll have a notebook that you can add to your portfolio or build on top of on your own. If you need help at any point, you can consult our solution notebook here.
The dataset is stored in thefandango_score_comparison.csv
file. It contains information on how major movie review services rated movies. The data originally came fromFiveThirtyEight.
Here are the first few rows of the data, in CSV format:
Each row represents a single movie. Each column contains information about how the online moview review services RottenTomatoes,Metacritic, IMDB, and Fandango rated the movie. The dataset was put together to help detect bias in the movie review sites. Each of these sites has 2 types of score -- User
scores, which aggregate user reviews, and Critic
score, which aggregate professional critical reviews of the movie. Each service puts their ratings on a different scale:
- RottenTomatoes --
0-100
, in increments of1
. - Metacritic --
0-100
, in increments of1
. - IMDB --
0-10
, in increments of.1
. - Fandango --
0-5
, in increments of.5
.
Typically, the primary score shown by the sites will be the Critic
score. Here are descriptions of some of the relevant columns in the dataset:
-
FILM
-- the name of the movie. -
RottenTomatoes
-- the RottenTomatoes (RT) critic score. -
RottenTomatoes_User
-- the RT user score. -
Metacritic
-- the Metacritic critic score. -
Metacritic_User
-- the Metacritic user score. -
IMDB
-- the IMDB score given to the movie. -
Fandango_Stars
-- the number of stars Fandango gave the movie.
To make it easier to compare scores across services, the columns were normalized so their scale and rounding matched the Fandango ratings. Any column with the suffix _norm
is the corresponding column changed to a 0-5
scale. For example, RT_norm
takes theRottenTomatoes
column and turns it into a 0-5
scale from a 0-100
scale. Any column with the suffix _round
is the rounded version of another column. For example,RT_user_norm_round
rounds the RT_user_norm
column to the nearest .5
.
Instructions
- Read the dataset into a Dataframe called
movies
using Pandas. - You can output a Dataframe as a table by typing just the variable name containing the Dataframe in the last line of a Jupyter cell. Do this with
movies
and look over the table. - If you're unfamiliar with RottenTomatoes, Metacritic, IMDB, orFandango, visit the websites to get a better handle on their review methodology.
import pandas movies = pandas.read_csv("fandango_score_comparison.csv")
2: Histograms
Now that you've read the dataset in, you can do some statistical exploration of the ratings columns. We'll primarily focus on theMetacritic_norm_round
and theFandango_Stars
columns, which will let you see how Fandango and Metacritic differ in terms of review scores.
Instructions
- Enable plotting in Jupyter notebook with
import matplotlib.pyplot as plt
and run the following magic%matplotlib inline
. - Create a histogram of the
Fandango_Stars
column. - Look critically at both histograms, and write up any differences you see in a markdown cell.
import matplotlib.pyplot as plt %matplotlib inline plt.hist(movies["Fandango_Stars"]) plt.hist(movies["Metacritic_norm_round"])
3: Mean, Median, And Standard Deviation
In the last screen, you may have noticed some differences between the Fandango and Metacritic scores. Metrics we've covered, including the mean, median, and standard deviation, allow you to quantify these differences. You can apply these metrics to the Fandango_Stars
andMetacritic_norm_round
columns to figure out how different they are.
Instructions
- Calculate the mean of both
Fandango_Stars
andMetacritic_norm_round
. - Calculate the median of both
Fandango_Stars
andMetacritic_norm_round
. - Calculate the standard deviation of both
Fandango_Stars
andMetacritic_norm_round
. You can use the numpy.std method to find this. - Print out all the values and look over them.
- Look at the review methodologies for Metacritic and Fandango. You can find the metholodogies on their websites, or by usingGoogle. Do you see any major differences? Write them up in a markdown cell.
- Write up the differences in numbers in a markdown cell, including the following:
- Why would the median for
Metacritic_norm_round
be lower than the mean, but the median forFandango_Stars
is higher than the mean? Recall that the mean is usually larger than the median when there are a few large values in the data, and lower when there are a few small values. - Why would the standard deviation for
Fandango_Stars
be much lower than the standard deviation forMetacritic_norm_round
? - Why would the mean for
Fandango_Stars
be much higher than the mean forMetacritic_norm_round
.
- Why would the median for
import numpy f_mean = movies["Fandango_Stars"].mean() m_mean = movies["Metacritic_norm_round"].mean() f_std = movies["Fandango_Stars"].std() m_std = movies["Metacritic_norm_round"].std() f_median = movies["Fandango_Stars"].median() m_median = movies["Metacritic_norm_round"].median() print(f_mean) print(m_mean) print(f_std) print(m_std) print(f_median) print(m_median)
4: Scatter Plots
We know the ratings tend to differ, but we don't know which movies tend to be the largest outliers. You can find this by making a scatterplot, then looking at which movies are far away from the others.
You can also subtract the Fandango_Stars
column from the Metacritic_norm_round
column, take the absolute value, and sortmovies
based on the difference to find the movies with the largest differences between their Metacritic and Fandango ratings.
Instructions
- Make a scatterplot that compares the
Fandango_Stars
column to theMetacritic_norm_round
column. - Several movies appear to have low ratings in Metacritic and high ratings in Fandango, or vice versa. We can explore this further by finding the differences between the columns.
- Subtract the
Fandango_Stars
column from theMetacritic_norm_round
column, and assign to a new column,fm_diff
, inmovies
. - Assign the absolute value of
fm_diff
tofm_diff
. This will ensure that we don't only look at cases whereMetacritic_norm_round
is greater thanFandango_Stars
.- You can calculate absolute values with the absolutefunction in NumPy.
- Sort
movies
based on thefm_diff
column, in descending order. - Print out the top
5
movies with the biggest differences betweenFandango_Stars
andMetacritic_norm_round
.
- Subtract the
plt.scatter(movies["Metacritic_norm_round"], movies["Fandango_Stars"])
5: Correlations
Let's see what the correlation coefficient betweenFandango_Stars
and Metacritic_norm_round
is. This will help you determine if Fandango consistently has higher scores than Metacritic, or if only a few movies were assigned higher ratings.
You can then create a linear regression to see what the predicted Fandango score would be based on the Metacritic score.
Instructions
- Calculate the r-value measuring the correlation between
Fandango_Stars
andMetacritic_norm_round
using thescipy.stats.pearsonr function. - The correlation is actually fairly low. Write up a markdown cell that discusses what this might mean.
- Use the scipy.stats.linregress function create a linear regression line with
Metacritic_norm_round
as the x-values andFandango_Stars
as the y-values. - Predict what a movie that got a
3.0
in Metacritic would get on Fandango using the line.
movies["fm_diff"] = numpy.abs(movies["Metacritic_norm_round"]- movies["Fandango_Stars"])movies.sort("fm_diff", ascending=False).head(5) from scipy.stats import pearsonr r_value, p_value = pearsonr(movies["Fandango_Stars"], movies["Metacritic_norm_round"]) r_value from scipy.stats import linregress slope, intercept, r_value, p_value, stderr_slope = linregress(movies["Metacritic_norm_round"], movies["Fandango_Stars"]) pred = 3 * slope + intercept pred
6: Finding Residuals
In the last screen, you created a linear regression for relating Metacritic_norm_round
toFandango_Stars
. You can create a residual plot to better visualize how the line relates to the existing datapoints. This can help you see if two variables are linearly related or not.
Instructions
- Predict what a movie that got a
4.0
in Metacritic would get on Fandango using the line from the last screen. - Make a scatter plot using the scatter function in
matplotlib.pyplot
. - On top of the scatter plot, use the plot function in
matplotlib.pyplot
to plot a line using the predicted values for3.0
and4.0
.- Setup the
x
values as the list[3.0, 4.0]
. - The
y
values should be a list with the corresponding predictions. - Pass in both
x
andy
to plot to create a line.
- Setup the
- Show the plot.
import random random.seed(1) random_100 = [random.randint(0, 5) for _ in range(100)] random_100_x = numpy.array(random_100) random_100_y = random_100_x * slope + intercept fig = plt.figure(figsize=(16, 16)) ax = fig.add_subplot(111) ax.plot(random_100_x, random_100_y, c='r', label='Prediction') ax.scatter(movies['Metacritic_norm_round'], movies['Fandango_Stars'], c='b', label='Real') plt.legend(loc='upper left'); plt.xlabel('Metacritic_norm_round') plt.ylabel('Fandango_Stars') ax.set_xlim([-0.5, 5.5]) ax.set_ylim([-0.5, 5.5]) sns.plt.show()
7: Next Steps
That's it for the guided steps! We recommend exploring the data more on your own.
Here are some potential next steps:
- Explore the other rating services, IMDB and RottenTomatoes.
- See how they differ from each other.
- See how they differ from Fandango.
- See how user scores differ from critic scores.
- Acquire more recent review data, and see if the pattern of Fandango inflating reviews persists.
- Dig more into why certain movies had their scores inflated more than others.
We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.
You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.
转载于:https://my.oschina.net/Bettyty/blog/749694