Applying decision trees

1: The Dataset

In the past two missions, we learned about how decision trees are constructed. We used a modified version of ID3, which is a bit simpler than the most common tree building algorithms, C4.5, and CART. However, the basics are all the same, and so we can apply the principles we learned about how decision trees work to any tree construction algorithm.

In this mission, we'll learn about when to use decision trees, and how to use them most effectively.

We've been using a dataset on US income, which we'll keep using here. The data is from the 1994 Census, and contains information on an individual's marital status, age, type of work, and more. The target column, high_income, is if they make less than or equal to 50k a year (0), or more than 50k a year (1).

You can download the data from here.

2: Using Decision Trees With Scikit-Learn

We can use the scikit-learn package to fit a decision tree. The interface is very similar to other algorithms we've fit in the past.

We use the DecisionTreeClassifier class for classification problems, and DecisionTreeRegressor for regression problems. Both of these classes are in the sklearn.tree package.

In this case, we're predicting a binary outcome, so we'll use a classifier.

The first step is to train the classifier on the data. We'll use the fit method on a classifier to do this.

Instructions

Fit clf to the income data.

  • Pass in income[columns] to only use the named columns as predictors.
  • The target is the high_incomecolumn.

from sklearn.tree import DecisionTreeClassifier

# A list of columns to train with.
# All columns have been converted to numeric.
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

# Instantiate the classifier.
# Set random_state to 1 to keep results consistent.
clf = DecisionTreeClassifier(random_state=1)

# The variable income is loaded, and contains all the income data.
clf.fit(income[columns], income["high_income"])

3: Splitting The Data Into Train And Test Sets

Now that we've fit a model, we can make predictions. We'll want to split our data into training and testing sets first. If we don't, we'll be making predictions on the same data that we train our algorithm with. This leads to overfitting, and will make our error appear lower than it is.

We covered overfitting in more depth earlier, but a simple explanation is that if you memorize how to perform 3 specific addition problems (2+23+63+3), you'll get those specific problems correct every time. On the other hand, if you're asked 4+4, you won't know how to do it, because you don't know the rules of addition.

If you learn the rules of addition, you'll sometimes get problems wrong (3443343434+24344343 can be hard to do mentally), but you'll be able to do any problem, and you'll get most of them right. Overfitting is the first example, where you memorize the details of the training set, but are unable to generalize to new examples that you're asked to make predictions on.

We can avoid overfitting by always making predictions and evaluating error on data that our algorithm hasn't been trained with. This will show us when we're overfitting by giving us a realistic error on data that the algorithm hasn't seen before.

We can split the data by shuffling the order of the dataframe, then selecting certain rows to be in the training set, and certain rows to be in the testing set.

In this case, we'll make 80% of our rows training data, and the rest testing data.

Instructions

All the rows in income with a position up to train_max_row (but not including it) will be part of the training set.

  • Make a new dataframe calledtrain containing all of these rows.
  • Make a dataframe called testcontaining all of the rows with a position greater than or equal totrain_max_row.

import numpy
import math

# Set a random seed so the shuffle is the same every time.
numpy.random.seed(1)

# Shuffle the rows.  This first permutes the index randomly using numpy.random.permutation.
# Then, it reindexes the dataframe with this.
# The net effect is to put the rows into random order.
income = income.reindex(numpy.random.permutation(income.index))

train_max_row = math.floor(income.shape[0] * .8)
train = income.iloc[:train_max_row]
test = income.iloc[train_max_row:]

4: Evaluating Error

While there are many methods to evaluate error with classification, we'll use AUC, which we covered extensively earlier in the machine learning material. AUC ranges from 0 to 1, and is ideal for binary classification. The higher the AUC, the more accurate our predictions.

We can compute AUC with the roc_auc_score function from sklearn.metrics. This function takes in 2 parameters:

  • y_true: true labels
  • y_score: predicted labels

and returns the computed AUC value.

Instructions

  • Compute the AUC betweenpredictions and thehigh_income column of testand assign the result to error.
  • Use the print function to display error.

from sklearn.metrics import roc_auc_score

clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])

predictions = clf.predict(test[columns])
error = roc_auc_score(test["high_income"], predictions)
print(error)

5: Compute Error On The Training Set

The AUC for the predictions on the testing set is about .694. Let's compare this against the AUC for predictions on the training set to see if the model is overfitting.

It's normal for the model to predict the training set better than the testing set. After all, it has full knowledge of that data and the outcomes. However, if the AUC between training set predictions and actual values is significantly higher than the AUC between test set predictions and actual values, it's a sign that the model may be overfitting.

Instructions

  • Print out the AUC score betweenpredictions and thehigh_income column of train.

predictions = clf.predict(train[columns])
print(roc_auc_score(train["high_income"], predictions))

6: Decision Tree Overfitting

Our AUC on the training set was .947. The AUC on the test set was .694. There's no hard and fast rule on when overfitting is happening, but our model is predicting the training set much better than it's predicting the test set. Splitting the data into training and testing sets doesn't prevent overfitting -- it just helps us detect it and fix it.

Based on our AUC measurements, it appears that we are in fact overfitting. Let's look a little more into why decision trees might overfit.

In the last mission, we looked at this data:

 

 
high_income    age    marital_status
0              20     0
0              60     2
0              40     1
1              25     1
1              35     2
1              55     1

Here's the full diagram for the decision tree we can build from the above data:

FulltreeAgeabove37.5?1NoYesAgeaboveAgeabove25?255?7NYesNYesAgeaboveLeaf(1)AgeaboveLeaf(0)22.5?3647.5?811NYesNYesLeaf(0)Leaf(1)Leaf(0)Leaf(1)45910

Applying decision trees

This tree perfectly predicts all of our values. It can always get a right answer on the training set. This is equivalent to memorizing the rules of addition. We've built our tree in such a way that it can perfectly predict the training set -- but, the way the tree has been constructed doesn't make sense when we step back.

The tree above is saying, if you're under 22.5 years old, you have low income. If you're 22.5 - 37.5, high income. If you're 37.5 -47.5, low income. If you're 47.5 to 55, high income. If you're above 55, low income. These rules are very specific to the training set.

Think about the problem with a real-world lens. Does it make sense to predict that someone who is 20 is low income, someone who is25 is high income, and someone who is 40 is low income? Intuitively, we know that people who are younger probably make less, people who are middle aged make more, and people who have retired make less.

Our tree has created so many age-based splits in an attempt to perfectly predict everyone's income that each split is effectively meaningless.

Here's a tree that matches up with our intuition better:

SmallertreeAgeabove37.5?1NoYesAgeaboveAgeabove25?255?7NYesNYesLeaf(0)Leaf(1)Leaf(.66)Leaf(0)6811

Applying decision trees

All we've done is "pruned" the tree, and removed some of the lower leaves. We've made some of the higher up nodes into leaves instead.

The tree above makes more intuitive sense. If you're under 25, we predict low income. If you're between 25 and 55, we predict high income (the .66 rounds up to 1). If you're above 55, we predict low income.

This actually has lower accuracy on our training set, but it will generalize better to new examples, because it matches reality better.

Trees overfit when they have too much depth, and make overly complex rules that match the training data, but aren't able to generalize well to new data.

This may seem to be a strange principle at first, but the more depth a tree has, typically the worse it performs on new data.

7: Building A Shallower Tree

There are three main ways to combat overfitting:

  • "Prune" the tree after building to remove unneeded leaves.
  • Use ensembling to blend the predictions of many trees.
  • Restrict the depth of the tree while you're building it.

We'll explore all of these, but we'll look at the third method first.

By controlling how deep the tree can go while we build it, we keep the rules more general than they would be otherwise. This prevents the tree from overfitting.

We can restrict how deep the tree is built with a few parameters when we initialize the DecisionTreeClassifier class:

  • max_depth -- this globally restricts how deep the tree can go.
  • min_samples_split -- The minimum number of rows needed in a node before it can be split. For example, if this is set to 2, then nodes with 2 rows won't be split, and will become leaves instead.
  • min_samples_leaf -- the minimum number of rows that a leaf must have.
  • min_weight_fraction_leaf -- the fraction of input rows that are required to be at a leaf.
  • max_leaf_nodes -- the maximum number of total leaves. This will cap the count of leaf nodes as the tree is being built.

As you can see, some of these parameters don't make sense together. Having max_depth and max_leaf_nodes together isn't allowed.

Now that we know what to tweak, let's improve our model.

Instructions

  • Set min_samples_split to 13when creating theDecisionTreeClassifier.
  • Make predictions on the training set, and compute AUC and assign it to train_auc.
  • Make predictions on the test set, and compute AUC and assign it to test_auc.

# Decision trees model from the last screen.
clf = DecisionTreeClassifier(random_state=1)
clf = DecisionTreeClassifier(min_samples_split=13, random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)

8: More Parameter Tweaking

By setting min_samples_split to 13, we managed to boost test AUC from .694 to .700. Training set AUC decreased from .947 to.843, showing that the model we built was less overfit to the training set than before:

 

settings train AUC test AUC
default 0.947 0.694
min_samples_split: 13 0.843 0.700

 

 

Let's play around some more with parameters.

Instructions

  • Set max_depth to 7 andmin_samples_split to 13when creating theDecisionTreeClassifier.
  • Make predictions on the training set, and compute AUC and assign it to train_auc.
  • Make predictions on the test set, and compute AUC and assign it to test_auc.

# First decision trees model we trained and tested.
clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)
clf = DecisionTreeClassifier(random_state=1, min_samples_split=13, max_depth=7)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)

9: Tweaking The Depth

We just improved the AUC again! Test set AUC increased to .744, while the training set AUC decreased to .748:

 

settings train AUC test AUC
default (min_samples_split: 2, max_depth: None) 0.947 0.694
min_samples_split: 13 0.843 0.700
min_samples_split: 13, max_depth: 7 0.748 0.7744

 

We aren't overfitting anymore since both AUC valeus are about the same. Let's tweak the parameters more aggressively, and see what happens!

Instructions

  • Set max_depth to 2 andmin_samples_split to 100when creating theDecisionTreeClassifier.
  • Make predictions on the training set, and compute AUC and assign it to train_auc.
  • Make predictions on the test set, and compute AUC and assign it to test_auc.

# First decision trees model we trained and tested.
clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)
clf = DecisionTreeClassifier(random_state=1, min_samples_split=100, max_depth=2)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)

10: Underfitting

Our accuracy went down in the past screen relative to the screen before:

 

settings train AUC test AUC
default (min_samples_split: 2, max_depth: None) 0.947 0.694
min_samples_split: 13 0.843 0.700
min_samples_split: 13, max_depth: 7 0.748 0.7744
min_samples_split: 100, max_depth: 2 0.662 0.655

 

This is because we're now underfitting. Underfitting is what happens when our model is too simple to actually explain the relations between the variables.

Let's go back to our tree diagram to explain underfitting.

Here's the data:

 

 
high_income    age    marital_status
0              20     0
0              60     2
0              40     1
1              25     1
1              35     2
1              55     1

And here's the "right fit" tree. This tree explains the data properly, without overfitting:

"Rightfit"treeAgeabove37.5?1NoYesAgeaboveAgeabove25?255?7NYesNYesLeaf(0)Leaf(1)Leaf(.66)Leaf(0)6811

Applying decision trees

Let's trim this tree even more to show what happens when the model isn't complex enough to explain the data:

UnderfittreeAgeabove37.5?1NoYesLeaf(.66)Leaf(.33)23

Applying decision trees

In this model, anybody under 37.5 will be predicted to have high income (.66 rounds up), and anyone over 37.5 will be predicted to have low income (.33 rounds down). This model is too simple to model reality -- which is younger people make less, middle-aged people make more, and elderly people make less.

Thus, this tree underfits the data and will have lower accuracy than the properly fit version.

11: The Bias-Variance Tradeoff

By artificially restricting the depth of our tree, we prevent it from creating a complex enough model to correctly categorize some of the rows. If we don't perform the artificial restrictions, the tree becomes too complex, and fits quirks in the data that only exist in the training set, but don't generalize to new data.

This is known as the bias-variance tradeoff. If we take a random sample of training data and create many models, if the predictions of the models for the same row are far apart from each other, we have high variance. If we take a random sample of training data, and create many models, and the predictions of the models for the same row are close together, but far from the actual value, then we have high bias.

High bias can cause underfitting -- if a model is consistently failing to predict the correct value, it may be that it is too simple to actually model the data.

High variance can cause overfitting -- if a model is very susceptible to small changes in the input data, and changes its predictions massively, then it is likely fitting itself to quirks in the training data, and not making a generalizable model.

It's called the bias-variance tradeoff because decreasing one will usually increase the other. This is a limitation of all machine learning algorithms. If you want to read more about the tradeoff, you can look here.

In general, decision trees suffer from high variance. The whole structure of a decision tree can change if you make a minor alteration to its training data. By restricting the depth of the tree, we increase the bias and decrease the variance. If we restrict the depth too much, we increase bias to the point where it will underfit.

Generally, you'll need to use your intuition and manually tweak parameters to get the "right" fit.

12: Exploring Decision Tree Variance

We can induce variance and see what happens with a decision tree. To add noise to the data, we'll just add a column of random values. A model with high variance (like a decision tree) will pick up on this noise, and overfit to it. This is because models with high variance are very sensitive to small changes in input data.

Instructions

  • Fit the classifier to the training data.

  • Make predictions on the training set, and compute AUC and assign it to train_auc.

  • Make predictions on the test set, and compute AUC and assign it to test_auc.

 

numpy.random.seed(1)

# Generate a column with random numbers from 0 to 4.
income["noise"] = numpy.random.randint(4, size=income.shape[0])

# Adjust columns to include the noise column.
columns = ["noise", "age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

# Make new train and test sets.
train_max_row = math.floor(income.shape[0] * .8)
train = income.iloc[:train_max_row]
test = income.iloc[train_max_row:]

# Initialize the classifier.
clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)
print(train_auc)

13: Pruning

As you can see above, the random noise column causes significant overfitting. Our test set accuracy decreases to .691, and our training set accuracy increases to .975.

One way to prevent overfitting that we tried before was to prevent the tree from growing beyond a certain depth. Another technique is called pruning. Pruning involves building a full tree, and then removing the leaves that don't add to prediction accuracy. Pruning prevents a model from becoming overly complex, and can make a simpler model with higher accuracy on the testing set.

Pruning is less commonly used than parameter optimization (like we just did), and ensembling. That's not to say that it isn't an important technique, and we'll cover it in more depth down the line.

14: When To Use Decision Trees

Let's go over the main advantages and disadvantages of decision trees. The main advantages of decision trees are:

  • Easy to interpret
  • Relatively fast to fit and make predictions
  • Able to handle multiple types of data
  • Can pick up nonlinearities in data, and are usually fairly accurate

The main disadvantage is a tendency to overfit.

In tasks where it's important to be able to interpret and convey why the algorithm is doing what it's doing, decision trees are a good choice.

The most powerful way to reduce decision tree overfitting is to create ensembles of trees. A popular algorithm to do this is called random forest. We'll cover random forests in the next mission. In cases where prediction accuracy is the most important consideration, random forests usually perform better.

In the next mission, we'll explore the random forest algorithm in more depth.

转载于:https://my.oschina.net/Bettyty/blog/752982