Predicting Boston House Price

In this project, we will evaluate the predictive power of a model that has been trained and tested on data collected from homes in suburbs of Boston. A model trained on this data that is seen as a good fit could then be used to make certain predictions about a home's monetary value.

The Boston house data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes. For the purpose of the project, certain preprocessing steps have been made to the dataset.

In [1]:
# Load Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import ShuffleSplit
import seaborn as sns

%matplotlib inline

# load the data
data = pd.read_csv("../data/housing.csv")
features = data.drop('MEDV',axis=1)
price = data['MEDV']

print "In Boston housing dataset, there are {} data points with {} features.".format(*features.shape)
/home/jinux/anaconda2/envs/venv/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In Boston housing dataset, there are 489 data points with 3 features.

Data Exploration

First, we are going to get descriptive statistic about the data.

  • 'RM' is the average number of rooms among homes in the neighborhood.
  • 'LSTAT' is the percentage of homeowners in the neighborhood considered "lower class" (working poor).
  • 'PTRATIO' is the ratio of students to teachers in primary and secondary schools in the neighborhood.
In [2]:
price.describe()
Out[2]:
count    4.890000e+02
mean     4.543429e+05
std      1.653403e+05
min      1.050000e+05
25%      3.507000e+05
50%      4.389000e+05
75%      5.187000e+05
max      1.024800e+06
Name: MEDV, dtype: float64
In [3]:
sns.distplot(price,bins=20,kde=False)
plt.xlabel('House Price')
plt.ylabel('Count')
plt.show()
In [4]:
# examine the relationship among features and target
plt.figure(figsize=[10,20])
sns.pairplot(data)
plt.show()
<matplotlib.figure.Figure at 0x7fb56f7bca90>

We could see that RM and LSTAT are good indicator of the house price. We could anticapte that from the decision tree model the important features are these two features.

Training and Testing data split

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(features, price, test_size = 0.1, random_state = 0)

Developing model

It is difficult to measure the quality of a given model without quantifying its performance over training and testing data. For this project we will calculate the coefficient of determination, $R^2$, to quantify the model's performance. The value of $R^2$ range from 0 to 1. A model with $R^2$ as 0 is no better than a model that always predict the mean of the target variable, whereas a model with $R^2$ as 1 perfectly predict the target variable. $R^2$ of a model could be negative, which means that the model is abitrarily worese than the model always predict the mean.

In [6]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_pred):
    return r2_score(y_true, y_pred)

Analyzing model

Here we explore to use the decision tree regressor. And "max_depth" parameter is the criteria to investigate the model performance. We plot the learning curve of the model for both training and testing data.

In [7]:
from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeRegressor

cv_sets = ShuffleSplit(X_train.shape[0], n_iter = 10, test_size = 0.2, random_state=1)
train_sizes = np.rint(np.linspace(1, 440, 8))/440

plt.figure(figsize=[10,10])
plt.subplots_adjust(hspace=0.3)
for k, depth in enumerate([1,4,7,10]):
    regressor = DecisionTreeRegressor(max_depth=depth)
    sizes, train_score, test_score = learning_curve(regressor, X_train, Y_train, train_sizes = train_sizes, 
                                                   cv = cv_sets, scoring = 'r2')
    plt.subplot(2,2,k+1)
    plt.plot(sizes,np.mean(train_score,axis=1),marker = 'o', color = 'g', label='Training Score')
    plt.plot(sizes,np.mean(test_score,axis=1), marker = 'o', color = 'r', label = 'Testing Score')
    plt.fill_between(sizes, np.mean(train_score,axis=1)-np.std(train_score,axis=1),
                    np.mean(train_score,axis=1)+np.std(train_score,axis=1), color = 'g')
    plt.fill_between(sizes, np.mean(test_score,axis=1)-np.std(test_score,axis=1),
                    np.mean(test_score,axis=1)+np.std(test_score,axis=1), color = 'r',alpha = 0.4)
    plt.xlabel("Samples")
    plt.ylabel("R2")
    plt.xlim([1,350])
    plt.title("max_depth {}".format(depth))
    
plt.legend(bbox_to_anchor=(1.5,2.2),loc = 1, fontsize =12)
plt.suptitle('Decision Tree Regressor Performance', fontsize=12)
plt.show()

Next, we investigate the complexity curve along with "max_depth" parameter.

In [10]:
from sklearn.model_selection import validation_curve
max_depths = np.arange(1,11)

train_val_scores, test_val_scores = validation_curve(regressor, X_train, Y_train, scoring =  'r2',
                                             param_name='max_depth',param_range=max_depths, cv = cv_sets) 
plt.figure(figsize=[6,6])
plt.plot(max_depths,np.mean(train_val_scores,axis=1), marker='o', color = 'g', label='Training Error')
plt.plot(max_depths,np.mean(test_val_scores,axis=1), marker='o', color = 'r', label = 'Validation Error')
plt.fill_between(max_depths,np.mean(train_val_scores,axis=1)-np.std(train_val_scores,axis=1),
                 np.mean(train_val_scores,axis=1)+np.std(train_val_scores,axis=1), color = 'g', alpha = 0.3)
plt.fill_between(max_depths,np.mean(test_val_scores,axis=1)-np.std(test_val_scores,axis=1),
                 np.mean(test_val_scores,axis=1)+np.std(test_val_scores,axis=1), color = 'r', alpha = 0.3)
plt.xlabel('max_depth')
plt.ylabel('R2')
plt.legend(loc= 2,fontsize=14)
plt.show()

From the validation curve, we could anticipate that the model with max_depth as 3 has the best performance.

Find the best model with Grid Search and Cross Validation

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

cv_sets = ShuffleSplit(X_train.shape[0], n_iter=10, test_size= 0.2, random_state = 0)

params = {'max_depth':np.arange(1, 11)}

regressor = DecisionTreeRegressor()

score_fun = make_scorer(performance_metric)

rbm = GridSearchCV(regressor,param_grid=params, scoring= score_fun, cv=cv_sets)

grid = rbm.fit(X_train, Y_train)
In [28]:
print "The max_depth is {} for the optimal model".format(grid.best_params_['max_depth'])
The max_depth is 4 for the optimal model

Use the optimal model on test data set:

In [30]:
opt_reg = grid.best_estimator_
y_pred = opt_reg.predict(X_test)

p_metric = performance_metric(Y_test, y_pred)
In [31]:
p_metric
Out[31]:
0.83051700594641198