Guided Project: Predicting bike rentals
https://github.com/dataquestio/solutions/blob/master/Mission213Solution.ipynb
1: The Dataset
In many American cities, there are communal bicycle sharing stations where you can rent bicycles by the hour or by the day. Washington, D.C. is one of these cities, and has detailed data available about how many bicycles were rented by hour and by day.
Hadi Fanaee-T at the University of Portocompiled this data into a CSV file, which you'll be working with in this project. The file contains17380
rows, and each row represents the bike rentals in a single hour of a single day. The data can be downloaded here. If you need help at any point, you can consult our solution notebookhere.
Here's what the first 5 rows look like:
Here are explanations of the relevant columns:
-
instant
-- a unique sequential id number for each row. -
dteday
-- the date the rentals occurred on. -
season
-- the season the rentals occurred in. -
yr
-- the year the rentals occurred in. -
mnth
-- the month the rentals occurred in. -
hr
-- the hour the rentals occurred in. -
holiday
-- whether or not the day was a holiday. -
weekday
-- whether or not the day was a weekday. -
workingday
-- whether or not the day was a working day. -
weathersit
-- the weather situation (categorical variable). -
temp
-- the temperature on a0-1
scale. -
atemp
-- the adjusted temperature. -
hum
-- the humidity on a0-1
scale. -
windspeed
-- the wind speed on a0-1
scale. -
casual
-- the number of casual riders (people who hadn't previously signed up with the bikesharing program) that rented bikes. -
registered
-- the number of registered riders (people who signed up previously) that rented bikes. -
cnt
-- the total number of bikes rented (casual
+registered
).
In this project, you'll try to predict the total number of bikes rented in a given hour. You'll predict the cnt
column using all the other columns, except casual
and registered
. To do this, you'll create a few different machine learning models and evaluate their performance.
Instructions
- Use the Pandas library to read
bike_rental_hour.csv
into the Dataframebike_rentals
. - Print out the first few rows of
bike_rentals
and take a look at the data. - Make a histogram of the
cnt
column ofbike_rentals
, and take a look at the distribution of total rentals. - Use the corr method on the
bike_rentals
Dataframe to explore how each column is correlated withcnt
.
2: Calculating Features
It can often be helpful to calculate features before applying machine learning models. Features can enhance the accuracy of models by introducing new information, or distilling existing information.
For example, the hr
column in bike_rentals
contains hours that bikes are rented, from 1
to24
. A machine will treat each hour differently, and not understand that certain hours are related. We can introduce some order into this by creating a new column with labels for morning
,afternoon
, evening
, and night
. This will bundle up similar times together, and enable the model to make better decisions.
Instructions
- Write a function called
assign_label
that takes in a numeric hour value, and returns:-
1
if the hour is from6
to12
. -
2
if the hour is from12
to18
. -
3
if the hour is from18
to24
. -
4
if the hour is from0
to6
.
-
- Use the apply method on Series to apply the function to each item in the
hr
column. - Assign the result to the
time_label
column ofbike_rentals
.
3: Train/Test Split
Before you can start applying machine learning algorithms, you'll need to split the data into training and testing sets. This will enable you to train an algorithm using the training set and evaluate its accuracy on the testing set. If you train an algorithm on the training data, and evaluate its performance on the same data, you can get an unrealistically low error value, due to overfitting.
Instructions
- Based on your explorations of the
cnt
column, pick an error metric you want to use to evaluate the performance of the machine learning algorithms. Write up a markdown cell explaining why you picked this metric. - Select
80%
of the rows inbike_rentals
to be part of the training set using the sample method onbike_rentals
. Assign the result totrain
. - Select the rows that are in
bike_rentals
but not intrain
to be in the testing set. Assign the result totest
.- This line will generate a Boolean Series that is
False
when a row inbike_rentals
is not found intrain
:bike_rentals.index.isin(train.index)
- This line will select any rows in
bike_rentals
not found intrain
to be in the testing set:bike_rentals.loc[~bike_rentals.index.isin(train.index)]
- This line will generate a Boolean Series that is
4: Applying Linear Regression
Now that you've done some data exploration and manipulation, you're ready to apply linear regression to the data. Linear regression will likely work fairly well on this data, given that many of the columns are highly correlated with cnt
.
As you learned in earlier missions, linear regression works best when predictors are linearly correlated to the target, and when predictors are independent, and don't change meaning when combined with each other. The good thing about linear regression is that it is fairly resistant to overfitting because it is simple, but it also can be prone to underfitting the data, and not building a powerful enough model. This means that linear regression usually isn't the most accurate option.
You'll need to ignore the casual
andregistered
columns because cnt
is derived from these columns. If you're trying to predict the number of people who rent bikes in a given hour (cnt
), it doesn't make sense that you'd already know casual
or registered
, because those numbers are added together to get cnt
.
5: Applying Decision Trees
You're now ready to apply the decision tree algorithm. You'll be able to compare the error with the error from linear regression, which will enable you to pick the right algorithm for this dataset.
Decision trees tend to predict outcomes much more reliably than linear regression. Because decision trees are a fairly complex model, they also tend to overfit, particularly when parameters such as maximum depth and minimum number of samples per leaf aren't tweaked. Decision trees are also prone to instability -- small changes in the input data can result in a very different output model.
Instructions
Use the DecisionTreeRegressor class to fit a decision tree algorithm to the train data.
Make predictions using the DecisionTreeRegressor class on test.
Calculate the error between the predictions and the actual values.
Experiment with various parameters of the DecisionTreeRegressor class, including min_samples_leaf, to see if it changes error.
Write a markdown cell with your thoughts on the predictions and the error.
6: Applying Random Forests
You can now apply the random forest algorithm, which improves on the decision tree algorithm. Random forests tend to be much more accurate than simple models like linear regression. Because of how random forests are constructed, they tend to overfit much less than decision trees. Random forests can still be prone to overfitting, though, and tuning parameters such as maximum depth and minimum samples per leaf is important.
Instructions
- Use the RandomForestRegressor class to fit a random forest algorithm to the
train
data. - Make predictions using the RandomForestRegressor class on
test
. - Calculate the error between the predictions and the actual values.
- Experiment with various parameters of theRandomForestRegressor class, including
min_samples_leaf
, to see if it changes error. - Write a markdown cell with your thoughts on the predictions and the error.
7: Next Steps
That's it for the guided steps! We recommend exploring the data more on your own.
Here are some potential next steps:
- Calculate more features, such as:
- An index combining temperature, humidity, and wind speed.
- Try predicting
casual
andregistered
instead ofcnt
.
We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.
You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.
We hope this guided project has been a good experience, and please email us at [email protected] if you want to share your work!
转载于:https://my.oschina.net/Bettyty/blog/752991