In this project, we will evaluate the predictive power of a model that has been trained and tested on data collected from homes in suburbs of Boston. A model trained on this data that is seen as a good fit could then be used to make certain predictions about a home's monetary value.
The Boston house data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes. For the purpose of the project, certain preprocessing steps have been made to the dataset.
# Load Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import ShuffleSplit
import seaborn as sns
%matplotlib inline
# load the data
data = pd.read_csv("../data/housing.csv")
features = data.drop('MEDV',axis=1)
price = data['MEDV']
print "In Boston housing dataset, there are {} data points with {} features.".format(*features.shape)
First, we are going to get descriptive statistic about the data.
price.describe()
sns.distplot(price,bins=20,kde=False)
plt.xlabel('House Price')
plt.ylabel('Count')
plt.show()
# examine the relationship among features and target
plt.figure(figsize=[10,20])
sns.pairplot(data)
plt.show()
We could see that RM and LSTAT are good indicator of the house price. We could anticapte that from the decision tree model the important features are these two features.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(features, price, test_size = 0.1, random_state = 0)
It is difficult to measure the quality of a given model without quantifying its performance over training and testing data. For this project we will calculate the coefficient of determination, $R^2$, to quantify the model's performance. The value of $R^2$ range from 0 to 1. A model with $R^2$ as 0 is no better than a model that always predict the mean of the target variable, whereas a model with $R^2$ as 1 perfectly predict the target variable. $R^2$ of a model could be negative, which means that the model is abitrarily worese than the model always predict the mean.
from sklearn.metrics import r2_score
def performance_metric(y_true, y_pred):
return r2_score(y_true, y_pred)
Here we explore to use the decision tree regressor. And "max_depth" parameter is the criteria to investigate the model performance. We plot the learning curve of the model for both training and testing data.
from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeRegressor
cv_sets = ShuffleSplit(X_train.shape[0], n_iter = 10, test_size = 0.2, random_state=1)
train_sizes = np.rint(np.linspace(1, 440, 8))/440
plt.figure(figsize=[10,10])
plt.subplots_adjust(hspace=0.3)
for k, depth in enumerate([1,4,7,10]):
regressor = DecisionTreeRegressor(max_depth=depth)
sizes, train_score, test_score = learning_curve(regressor, X_train, Y_train, train_sizes = train_sizes,
cv = cv_sets, scoring = 'r2')
plt.subplot(2,2,k+1)
plt.plot(sizes,np.mean(train_score,axis=1),marker = 'o', color = 'g', label='Training Score')
plt.plot(sizes,np.mean(test_score,axis=1), marker = 'o', color = 'r', label = 'Testing Score')
plt.fill_between(sizes, np.mean(train_score,axis=1)-np.std(train_score,axis=1),
np.mean(train_score,axis=1)+np.std(train_score,axis=1), color = 'g')
plt.fill_between(sizes, np.mean(test_score,axis=1)-np.std(test_score,axis=1),
np.mean(test_score,axis=1)+np.std(test_score,axis=1), color = 'r',alpha = 0.4)
plt.xlabel("Samples")
plt.ylabel("R2")
plt.xlim([1,350])
plt.title("max_depth {}".format(depth))
plt.legend(bbox_to_anchor=(1.5,2.2),loc = 1, fontsize =12)
plt.suptitle('Decision Tree Regressor Performance', fontsize=12)
plt.show()
Next, we investigate the complexity curve along with "max_depth" parameter.
from sklearn.model_selection import validation_curve
max_depths = np.arange(1,11)
train_val_scores, test_val_scores = validation_curve(regressor, X_train, Y_train, scoring = 'r2',
param_name='max_depth',param_range=max_depths, cv = cv_sets)
plt.figure(figsize=[6,6])
plt.plot(max_depths,np.mean(train_val_scores,axis=1), marker='o', color = 'g', label='Training Error')
plt.plot(max_depths,np.mean(test_val_scores,axis=1), marker='o', color = 'r', label = 'Validation Error')
plt.fill_between(max_depths,np.mean(train_val_scores,axis=1)-np.std(train_val_scores,axis=1),
np.mean(train_val_scores,axis=1)+np.std(train_val_scores,axis=1), color = 'g', alpha = 0.3)
plt.fill_between(max_depths,np.mean(test_val_scores,axis=1)-np.std(test_val_scores,axis=1),
np.mean(test_val_scores,axis=1)+np.std(test_val_scores,axis=1), color = 'r', alpha = 0.3)
plt.xlabel('max_depth')
plt.ylabel('R2')
plt.legend(loc= 2,fontsize=14)
plt.show()
From the validation curve, we could anticipate that the model with max_depth as 3 has the best performance.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
cv_sets = ShuffleSplit(X_train.shape[0], n_iter=10, test_size= 0.2, random_state = 0)
params = {'max_depth':np.arange(1, 11)}
regressor = DecisionTreeRegressor()
score_fun = make_scorer(performance_metric)
rbm = GridSearchCV(regressor,param_grid=params, scoring= score_fun, cv=cv_sets)
grid = rbm.fit(X_train, Y_train)
print "The max_depth is {} for the optimal model".format(grid.best_params_['max_depth'])
opt_reg = grid.best_estimator_
y_pred = opt_reg.predict(X_test)
p_metric = performance_metric(Y_test, y_pred)
p_metric