Overfitting

1: Introduction

While exploring regression, we've briefly mentioned overfitting and the problems it can cause. In this mission, we'll explore how to identify overfitting and what you can do to avoid it. To explore overfitting, we'll use a dataset on cars which dataset contains 7 numerical features that could have an effect on a car's fuel efficiency:

  • cylinders -- the number of cylinders in the engine.
  • displacement -- the displacement of the engine.
  • horsepower -- the horsepower of the engine.
  • weight -- the weight of the car.
  • acceleration -- the acceleration of the car.
  • model year -- the year that car model was released (e.g. 70 corresponds to 1970).
  • origin -- where the car was manufactured (0 if North America, 1 if Europe, 2 if Asia).

The mpg column is our target column and is the one we want to predict using the other features.

The dataset is hosted by the University of California Irvine on their machine learning repository. You'll notice that the Data Foldercontains a few different files. We'll be working with auto-mpg.data, which omits the 8 rows containing missing values for fuel efficiency (mpg column).

The code below imports Pandas, reads the data into a Dataframe, and cleans up some messy values. Explore the dataset to become more familiar with it.

Instructions

This step is a demo. Play around with code or advance to the next step.

import pandas as pd
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"]
cars = pd.read_table("auto-mpg.data", delim_whitespace=True, names=columns)
filtered_cars = cars[cars['horsepower'] != '?']
filtered_cars['horsepower'] = filtered_cars['horsepower'].astype('float')

 

2: Bias And Variance

At the heart of understanding overfitting is understanding bias and variance. Bias and variance make up the 2 observable sources of error in a model that we can indirectly control.

Bias describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

Variance describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance.

In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

3: Bias-Variance Tradeoff

We've discussed before how overfitting generally happens when a model performs well on a training set but doesn't generalize well to new data. A key nuance here is that you should think of overfitting as a relative term. Between any 2 models, one will overfit more than the other one.

Understanding the bias variance tradeoff is critical to understanding overfitting. Every process has some amount of inherent noise that's unobservable. Overfit models tend to capture the noise as well as the signal in a dataset.

Scott Fortman Roe's blog post on the bias-variance tradeoff has a wonderful image that describes this tradeoff:

Overfitting

We can approximate the bias of a model by training a few different models from the same class (linear regression in this case) using different features on the same dataset and calculating their error scores. For regression, we can use mean absolute error, mean squared error, or R-squared.

We can calculate the variance of the predicted values for each model we train and we'll observe an increase in variance as we build more complex, multivariate models.

While an extremely simple, univariate linear regression model will underfit, an extremely complicated, multivariate linear regression model will overfit. Depending on the problem you're working on, there's a happy middle ground that will help you construct reliable and useful predictive models.

Let's first create a function that we can use for training the model and computing the bias and variance values and use it to train some simple, univariate models.

Instructions

  • Create a function namedtrain_and_test that:

    • Takes in a list of column names as the sole parameter (cols),
    • Trains a linear regression model using:
      • The columns incols as the features,
      • The mpg column as the target variable.
    • Uses the trained model to make predictions using the same input it was trained on,
    • Computes the variance of the predicted values and the mean squared error between the predicted values and the actual label (mpg column).
    • Returns the mean squared error value followed by the variance (e.g.return(mse, variance)).
  • Use the train_and_testfunction to train a model using only the cylinders column. Assign the resulting mean squared error value and variance to cyl_mse and cyl_var.

  • Use the train_and_testfunction to train a model using only the weight column. Assign the resulting mean squared error value and variance toweight_mse and weight_var.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def train_and_test(cols):
    # Split into features & target.
    features = filtered_cars[cols]
    target = filtered_cars["mpg"]
    # Fit model.
    lr = LinearRegression()
    lr.fit(features, target)
    # Make predictions on training set.
    predictions = lr.predict(features)
    # Compute MSE and Variance.
    mse = mean_squared_error(filtered_cars["mpg"], predictions)
    variance = np.var(predictions)
    return(mse, variance)
    
cyl_mse, cyl_var = train_and_test(["cylinders"])
weight_mse, weight_var = train_and_test(["weight"])

4: Multivariate Models

Now that we have a function for training a regression model and calculating the mean squared error and variance, let's use it to train and understand more complex models.

Instructions

Use the train_and_test function to train linear regression models using the following columns as the features:

  • columns: cylinders,displacement.
    • MSE: two_mse, variance:two_var.
  • columns: cylinders,displacementhorsepower.
    • MSE: three_mse, variance: three_var.
  • columns: cylinders,displacementhorsepower,weight.
    • MSE: four_mse, variance:four_var.
  • columns: cylinders,displacementhorsepower,weightacceleration.
    • MSE: five_mse, variance:five_var.
  • columns: cylinders,displacementhorsepower,weightaccelerationmodel year
    • MSE: six_mse, variance:six_var.
  • columns: cylinders,displacementhorsepower,weightaccelerationmodel yearorigin
    • MSE: seven_mse, variance: seven_var.

Use print statements or the variable inspector below to display each value.

# Our implementation for train_and_test, takes in a list of strings.
def train_and_test(cols):
    # Split into features & target.
    features = filtered_cars[cols]
    target = filtered_cars["mpg"]
    # Fit model.
    lr = LinearRegression()
    lr.fit(features, target)
    # Make predictions on training set.
    predictions = lr.predict(features)
    # Compute MSE and Variance.
    mse = mean_squared_error(filtered_cars["mpg"], predictions)
    variance = np.var(predictions)
    return(mse, variance)

one_mse, one_var = train_and_test(["cylinders"])
two_mse, two_var = train_and_test(["cylinders", "displacement"])
three_mse, three_var = train_and_test(["cylinders", "displacement", "horsepower"])
four_mse, four_var = train_and_test(["cylinders", "displacement", "horsepower", "weight"])
five_mse, five_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration"])
six_mse, six_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration", "model year"])
seven_mse, seven_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration","model year", "origin"])

5: Cross Validation

The multivariate regression models you trained got progressively better at reducing the amount of error.

A good way to detect if your model is overfitting is to compare the in-sample error and the out-of-sample error, or the training error with the test error. So far, we calculated the in sample error by testing the model over the same data it was trained on. To calculate the out-of-sample error, we need to test the data on a test set of data. We unfortunately don't have a separate test dataset and we'll instead use cross validation.

If a model's cross validation error (out-of-sample error) is much higher than the in sample error, then your data science senses should start to tingle. This is the first line of defense against overfitting and is a clear indicator that the trained model doesn't generalize well outside of the training set.

Let's create a new function to handle performing the cross validation and computing the cross validation error.

Instructions

Create a function namedtrain_and_cross_val that:

  • takes in a single parameter (list of column names),
  • trains a linear regression model using the features specified in the parameter,
  • uses the KFold class to perform 10-fold validation using a random seed of 3 (we use this seed to answer check your code),
  • calculates the overall, mean squared error across all folds and the overall, mean variance across all folds.
  • returns the overall mean squared error value then the overall variance (e.g. return(avg_mse, avg_var)).

Use the train_and_cross_valfunction to train linear regression models using the following columns as the features:

  • the cylinders anddisplacement columns. Assign the resulting mean squared error value to two_mse and the resulting variance value totwo_var.
  • the cylindersdisplacement, and horsepower columns. Assign the resulting mean squared error value tothree_mse and the resulting variance value to three_var.
  • the cylindersdisplacement,horsepower, and weightcolumns. Assign the resulting mean squared error value tofour_mse and the resulting variance value to four_var.
  • the cylindersdisplacement,horsepowerweight,acceleration columns. Assign the resulting mean squared error value to five_mse and the resulting variance value tofive_var.
  • the cylindersdisplacement,horsepowerweight,acceleration, and model year columns. Assign the resulting mean squared error value to six_mse and the resulting variance value tosix_var.
  • the cylindersdisplacement,horsepowerweight,accelerationmodel year, and origin columns. Assign the resulting mean squared error value to seven_mse and the resulting variance value toseven_var.

Use the variable display to inspect each value.

from sklearn.cross_validation import KFold
from sklearn.metrics import mean_squared_error
import numpy as np
def train_and_cross_val(cols):
    features = filtered_cars[cols]
    target = filtered_cars["mpg"]
    
    variance_values = []
    mse_values = []
    
    # KFold instance.
    kf = KFold(n=len(filtered_cars), n_folds=10, shuffle=True, random_state=3)
    
    # Iterate through over each fold.
    for train_index, test_index in kf:
        # Training and test sets.
        X_train, X_test = features.iloc[train_index], features.iloc[test_index]
        y_train, y_test = target.iloc[train_index], target.iloc[test_index]
        
        # Fit the model and make predictions.
        lr = LinearRegression()
        lr.fit(X_train, y_train)
        predictions = lr.predict(X_test)
        
        # Calculate mse and variance values for this fold.
        mse = mean_squared_error(y_test, predictions)
        var = np.var(predictions)

        # Append to arrays to do calculate overall average mse and variance values.
        variance_values.append(var)
        mse_values.append(mse)
   
    # Compute average mse and variance values.
    avg_mse = np.mean(mse_values)
    avg_var = np.mean(variance_values)
    return(avg_mse, avg_var)
        
two_mse, two_var = train_and_cross_val(["cylinders", "displacement"])
three_mse, three_var = train_and_cross_val(["cylinders", "displacement", "horsepower"])
four_mse, four_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight"])
five_mse, five_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration"])
six_mse, six_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration", "model year"])
seven_mse, seven_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration","model year", "origin"])

6: Plotting Cross-Validation Error Vs. Cross-Validation Variance

During cross validation, the more features we added to the model, the lower the mean squared error got. This is a good sign and indicates that the model generalizes well to new data it wasn't trained on. As the mean squared error value went up, however, so did the variance of the predictions. This is to be expected, since the models with lower squared error values had higher model complexity, which tends to be more sensitive to small variations in input values (or high variance).

For each model, let's plot the error and variance to get a better idea of the tradeoff as the number of features increased.

Instructions

  • On the same Axes instance:

    • Generate a scatter plot with the model's number of features on the x-axis and the model's overall, cross-validation mean squared error on the y-axis. Usered for the scatter dot color.
    • Generate a scatter plot with the model's number of features on the x-axis and the model's overall, cross-validation variance on the y-axis. Use blue for the scatter dot color.
  • Use plt.show() to display the scatter plot.

# We've hidden the `train_and_cross_val` function to save space but you can still call the function!
import matplotlib.pyplot as plt 
%matplotlib inline
        
two_mse, two_var = train_and_cross_val(["cylinders", "displacement"])
three_mse, three_var = train_and_cross_val(["cylinders", "displacement", "horsepower"])
four_mse, four_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight"])
five_mse, five_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration"])
six_mse, six_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration", "model year"])
seven_mse, seven_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration","model year", "origin"])
plt.scatter([2,3,4,5,6,7],[two_mse,three_mse,four_mse,five_mse,six_mse,seven_mse],c="red")
plt.scatter([2,3,4,5,6,7],[two_var, three_var, four_var, five_var, six_var, seven_var],c="blue")
plt.show()

Overfitting

7: Conclusion

While the higher order multivariate models overfit in relation to the lower order multivariate models, the in-sample error and out-of-sample didn't deviate by much. The best model was around 50% more accurate than the simplest model. On the other hand, the overall variance increased around 25% as we increased the model complexity. This is a really good starting point, but your work is not done! The increased variance with the increased model complexity means that your model will have more unpredictable performance on truly new, unseen data.

If you were working on this problem on a data science team, you'd need to confirm the predictive accuracy of the model using completely new, unobserved data (e.g. maybe from cars from later years). Since often you can't wait until a model is deployed in the wild to know how well it works, the exploration we did in this mission helps you approximate a model's real world performance.

8: Next Steps

In this mission, we explored overfitting at a deeper level and introduced related terminology that you'll see in other literature as well. So far, we've mostly dealt with supvervised machine learning models to solve regression and classification problems. In the next mission, we'll explore an unsupervised machine learning technique called k-means clustering.

 

 

转载于:https://my.oschina.net/Bettyty/blog/752504