This project is to accurately predict whether an individual makes more than $50,000. This sort of task can arise in a non-profit setting, where organizations survive on donations. Understanding an individual's income can help a non-profit better understand how large of a donation to request, or whether or not they should reach out to begin with. The dataset for this project originates from the UCI Machine Learning Repository.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('../data/census.csv')
data.head(1)
The first investigation is to determine how many indivials fit into either group. What is the percentage of these people making more than $50,000.
from __future__ import division
n_records = len(data)
n_greater_50k = len(data[data['income']=='>50K'])
n_at_most_50k = len(data[data['income']=='<=50K'])
print "There are {0} individuals with {1:.1f}% making more than $50,000, \
with {2:.1f}% making at most $50,000.".format(n_records, 100*n_greater_50k/n_records, 100*n_at_most_50k/n_records)
Before data can be used as input for machine learning algorithm, it often must be cleaned, formatted and restructured -- this is typically known as preprocessing.
A dataset may sometimes contain at least one feature whose value tend to lie near a single number, but will also have a non-trivial number of vastly larger or smaller values than that single number. Algorithms can be sensitive to such distribution of values and can underperform if the range is not properly normalized. With the census dataset two features fit this description: "capitcal-gain" and "capital-loss".
plt.figure(figsize=[10,4])
plt.subplots_adjust(wspace = 0.4, hspace = 0.2)
plt.subplot(1,2,1)
plt.hist(data['capital-gain'],bins=25)
plt.ylim([0,2000])
plt.yticks(np.linspace(0,2000,9),('0','250','500','750','1000','1250','1500','1750',">2000"))
plt.title('\'Capital gain\' feature distribution')
plt.subplot(1,2,2)
plt.hist(data['capital-loss'],bins=25)
plt.ylim([0,2000])
plt.yticks(np.linspace(0,2000,9),('0','250','500','750','1000','1250','1500','1750',">2000"))
plt.title('\'Capital lost\' feature distribution',)
plt.suptitle('Skewed Distribution of Data features')
plt.show()
For highly-skewed feature distribution such as "capital gain" and "capital lost", it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values cause by the outliers. (The logarithm of 0 is not defined.)
# split the data into feature space and target
income_raw = data['income']
features_raw = data.drop('income', axis=1)
skewed = ['capital-gain', 'capital-loss']
features_raw[skewed] = data[skewed].apply(lambda x: np.log(x+1))
plt.figure(figsize=[10,4])
plt.subplots_adjust(wspace = 0.4, hspace = 0.2)
plt.subplot(1,2,1)
plt.hist(features_raw['capital-gain'],bins=25)
plt.ylim([0,2000])
plt.yticks(np.linspace(0,2000,9),('0','250','500','750','1000','1250','1500','1750',">2000"))
plt.title('\'Capital gain\' feature distribution')
plt.subplot(1,2,2)
plt.hist(features_raw['capital-loss'],bins=25)
plt.ylim([0,2000])
plt.yticks(np.linspace(0,2000,9),('0','250','500','750','1000','1250','1500','1750',">2000"))
plt.title('\'Capital lost\' feature distribution',)
plt.suptitle('Skewed Distribution of Data features after logarithmic transformation')
plt.show()
In addition to performaing transformation on features that are highly skewed, it is often good practice to perform some type of scaling on numerical feature. Applying a scaling to the data does not change the shape of each feature's distribution. However, normalization ensure that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning. We will sklearn.preprocessing.MinMaxScaler for this;
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
numerical = ['age', 'education-num','capital-gain','capital-loss','hours-per-week']
features_raw[numerical] = scaler.fit_transform(features_raw[numerical])
features_raw.head(1)
From the table above, we can see there are several features for each record that are non-numerical. Typically, learning algorithm expect to be numerical, which requires that non-numerical features be converted. One popular way to convert non-numerical variables is by using one-hot-encoding scheme. One-hot encoding creates a "dummy" variable for each possible category of each non-numeric feature.
Additionally, we need to convert the non-numerical target label to numerical value for the learing algorithm to work.
# One hot encoding the feature raw data using panadas.get_dummies()
features = pd.get_dummies(features_raw,columns=['workclass','education_level','marital-status','occupation',
'relationship','race','sex','native-country'])
income = income_raw.apply(lambda x: 1 if x == ">50K" else 0)
# Print the number of features after one-hot encoding
encoded = list(features.columns)
print "{} total features after one-hot encoding.".format(len(encoded))
features.head(1)
As always, we will split the data into training and test set. 80% of the data will be used for training and 20% for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(features, income, test_size = 0.2, random_state = 10)
print "Training set has {} samples".format(X_train.shape[0])
print "Test set has {} samples".format(X_test.shape[0])
In this section, we will investigate multiple algorithms, and determine which is best at modelling the data.
Identifying someone that does not make more than \$50,000 as someone who does would be detrimental, since we are looking for find individuals willing to donate. Therefore, a model's ability to precisely predict those that make more than $50,000 is mor important than the model's ability to recall those individuals. We can use F-beta score as a metric that considers both precision and recall:
\begin{equation} F_{\beta} = (1+\beta^2)\times \frac{precision\times recall}{(\beta^2 \times precision) + recall} \end{equation}In particular, when $\beta = 0.5$, more emphasis is placed on precision.
If we choose a model that always predict an individual made more than \$50,000, what would tha model's accuracy and F-score be on the dataset.
# Calculate the accuracy
accuracy = n_greater_50k / n_records
# Calculate F-score using the formula above for beta = 0.5
precision = n_greater_50k/n_records
recall = n_greater_50k / n_greater_50k
fscore = (1+0.5**2)*(precision*recall)/(0.5**2*precision+recall)
print "Naive predictor: [Accuracy scoare {:.4f}, F-score: {:.4f}.]".format(accuracy, fscore)
The following supervised learning models are currently available in sciki-learn:
To properly evaluate the performance of each model, it's important that you create a training and predicting pipeline that allows you to quickly and effectively train model using varouse sizes of training data and perform prediction on the testing data.
from sklearn.metrics import fbeta_score, accuracy_score
from time import *
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
results = {}
start = time()
learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
end = time()
results['train_time'] = end-start;
# Get the prediction on the test set then get prediction on the first training sample
start = time()
prediction_test = learner.predict(X_test)
prediction_train= learner.predict(X_train[:300])
end = time()
results['pred_time'] = end - start
results['acc_train'] = accuracy_score(y_train[:300], prediction_train)
results['acc_test'] = accuracy_score(y_test, prediction_test)
results['f_train'] = fbeta_score(y_train[:300], prediction_train, beta=0.5)
results['f_test'] = fbeta_score(y_test, prediction_test, beta=0.5)
print "{} trained on {} samples".format(learner.__class__, sample_size)
return results
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import AdaBoostClassifier as Ad
from sklearn.ensemble import GradientBoostingClassifier as GBM
from sklearn.linear_model import SGDClassifier as SGD
clf_A = RF(random_state=10)
clf_B = Ad(random_state=10)
clf_C = GBM(random_state=10)
clf_D = SGD(random_state=10)
samples_1 = int(len(Y_train)/100)
samples_10 = int(len(Y_train)/10)
samples_100 = len(Y_train)
results = {}
for clf in [clf_A, clf_B, clf_C, clf_D]:
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1,samples_10, samples_100]):
results[clf_name][i] = train_predict(clf, samples, X_train, Y_train, X_test, Y_test)
import matplotlib.patches as mpatches
fig, ax = plt.subplots(2, 3, figsize = (15,7))
# Constants
bar_width = 0.2
colors = ['#2E86C1','#C0392B','#AAB7B8','#000A00']
# Super loop to plot four panels of data
for k, learner in enumerate(results.keys()):
for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_test', 'f_test']):
for i in np.arange(3):
# Creative plot code
ax[j/3, j%3].bar(i+k*bar_width, results[learner][i][metric], width = bar_width, color = colors[k],
label = learner)
ax[j/3, j%3].set_xticks([0.45, 1.45, 2.45])
ax[j/3, j%3].set_xticklabels(["1%", "10%", "100%"])
ax[j/3, j%3].set_xlabel("Training Set Size")
ax[j/3, j%3].set_xlim((-0.1, 3.0))
# Add unique y-labels
ax[0, 0].set_ylabel("Time (in seconds)")
ax[0, 1].set_ylabel("Accuracy Score")
ax[0, 2].set_ylabel("F-score")
ax[1, 0].set_ylabel("Time (in seconds)")
ax[1, 1].set_ylabel("Accuracy Score")
ax[1, 2].set_ylabel("F-score")
# Add titles
ax[0, 0].set_title("Model Training")
ax[0, 1].set_title("Accuracy Score on Training Subset")
ax[0, 2].set_title("F-score on Training Subset")
ax[1, 0].set_title("Model Predicting")
ax[1, 1].set_title("Accuracy Score on Testing Set")
ax[1, 2].set_title("F-score on Testing Set")
# Add horizontal lines for naive predictors
ax[0, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
ax[1, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
ax[0, 2].axhline(y = fscore, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
ax[1, 2].axhline(y = fscore, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
# Set y-limits for score panels
ax[0, 1].set_ylim((0, 1))
ax[0, 2].set_ylim((0, 1))
ax[1, 1].set_ylim((0, 1))
ax[1, 2].set_ylim((0, 1))
# Create patches for the legend
patches = []
for i, learner in enumerate(results.keys()):
patches.append(mpatches.Patch(color = colors[i], label = learner))
plt.legend(handles = patches, bbox_to_anchor = (-.80, 2.73), \
loc = 'upper center', borderaxespad = 0., ncol = 3, fontsize = 'x-large')
# Aesthetics
plt.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize = 16, y = 1.20)
plt.tight_layout()
plt.show()
Adaboost with 50 weak learner as decision tree with depth as 1. Gradient boosting with 100 trees, max_depth is 3. The training time of GBM grows much faster than Adaboost. The prediction time of Adaboost is higher than other methods.
Choose the best model from the above supervised models. From the above initial model evaluation, we could see that GBM method have high accuracy and F score for both training and test dataset. Even it requires higher training time, but the prediction time of GBM is relatively low.
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
# Initialize the classifier
clf = GBM(random_state=10,)
# Create the parameters to tune
parameters = {'loss':['deviance', 'exponential'],
'learning_rate':[0.1,1 ,1.2],
'n_estimators':[10,90, 200,300],
'max_depth':[1,3, 5, 7]}
# make a f_beta_score
score = make_scorer(fbeta_score,beta = 0.5)
# GridSearch
gridObj = GridSearchCV(clf, param_grid=parameters, scoring=score, cv=5, n_jobs=10)
# for the parameters
grid_fit = gridObj.fit(X_train, Y_train)
best_predictor = grid_fit.best_estimator_
predictions = clf.fit(X_train, Y_train).predict(X_test)
best_predictor = best_predictor.predict(X_test)
print fbeta_score(Y_test, predictions, beta=0.5)
print fbeta_score(Y_test, best_predictor, beta=0.5)