Find Charity Donors

This project is to accurately predict whether an individual makes more than $50,000. This sort of task can arise in a non-profit setting, where organizations survive on donations. Understanding an individual's income can help a non-profit better understand how large of a donation to request, or whether or not they should reach out to begin with. The dataset for this project originates from the UCI Machine Learning Repository.

Exploring the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

data = pd.read_csv('../data/census.csv')
data.head(1)
Out[1]:
age workclass education_level education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov Bachelors 13.0 Never-married Adm-clerical Not-in-family White Male 2174.0 0.0 40.0 United-States <=50K

The first investigation is to determine how many indivials fit into either group. What is the percentage of these people making more than $50,000.

In [2]:
from __future__ import division
n_records = len(data)

n_greater_50k = len(data[data['income']=='>50K'])
n_at_most_50k = len(data[data['income']=='<=50K'])

print "There are {0} individuals with {1:.1f}% making more than $50,000, \
with {2:.1f}% making at most $50,000.".format(n_records, 100*n_greater_50k/n_records, 100*n_at_most_50k/n_records)
There are 45222 individuals with 24.8% making more than $50,000, with 75.2% making at most $50,000.

Preparing the data

Before data can be used as input for machine learning algorithm, it often must be cleaned, formatted and restructured -- this is typically known as preprocessing.

Transforming Skewed Continuous Features

A dataset may sometimes contain at least one feature whose value tend to lie near a single number, but will also have a non-trivial number of vastly larger or smaller values than that single number. Algorithms can be sensitive to such distribution of values and can underperform if the range is not properly normalized. With the census dataset two features fit this description: "capitcal-gain" and "capital-loss".

In [3]:
plt.figure(figsize=[10,4])
plt.subplots_adjust(wspace = 0.4, hspace = 0.2)
plt.subplot(1,2,1)
plt.hist(data['capital-gain'],bins=25)
plt.ylim([0,2000])
plt.yticks(np.linspace(0,2000,9),('0','250','500','750','1000','1250','1500','1750',">2000"))
plt.title('\'Capital gain\' feature distribution')

plt.subplot(1,2,2)
plt.hist(data['capital-loss'],bins=25)
plt.ylim([0,2000])
plt.yticks(np.linspace(0,2000,9),('0','250','500','750','1000','1250','1500','1750',">2000"))
plt.title('\'Capital lost\' feature distribution',)

plt.suptitle('Skewed Distribution of Data features')
plt.show()

For highly-skewed feature distribution such as "capital gain" and "capital lost", it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values cause by the outliers. (The logarithm of 0 is not defined.)

In [4]:
# split the data into feature space and target 
income_raw = data['income']
features_raw = data.drop('income', axis=1)

skewed = ['capital-gain', 'capital-loss']
features_raw[skewed] = data[skewed].apply(lambda x: np.log(x+1))
In [5]:
plt.figure(figsize=[10,4])
plt.subplots_adjust(wspace = 0.4, hspace = 0.2)
plt.subplot(1,2,1)
plt.hist(features_raw['capital-gain'],bins=25)
plt.ylim([0,2000])
plt.yticks(np.linspace(0,2000,9),('0','250','500','750','1000','1250','1500','1750',">2000"))
plt.title('\'Capital gain\' feature distribution')

plt.subplot(1,2,2)
plt.hist(features_raw['capital-loss'],bins=25)
plt.ylim([0,2000])
plt.yticks(np.linspace(0,2000,9),('0','250','500','750','1000','1250','1500','1750',">2000"))
plt.title('\'Capital lost\' feature distribution',)

plt.suptitle('Skewed Distribution of Data features after logarithmic transformation')
plt.show()

Normalizing Numerical Features

In addition to performaing transformation on features that are highly skewed, it is often good practice to perform some type of scaling on numerical feature. Applying a scaling to the data does not change the shape of each feature's distribution. However, normalization ensure that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning. We will sklearn.preprocessing.MinMaxScaler for this;

In [6]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
numerical = ['age', 'education-num','capital-gain','capital-loss','hours-per-week']

features_raw[numerical] = scaler.fit_transform(features_raw[numerical])
In [7]:
features_raw.head(1)
Out[7]:
age workclass education_level education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 0.30137 State-gov Bachelors 0.8 Never-married Adm-clerical Not-in-family White Male 0.667492 0.0 0.397959 United-States

One-hot encoding

From the table above, we can see there are several features for each record that are non-numerical. Typically, learning algorithm expect to be numerical, which requires that non-numerical features be converted. One popular way to convert non-numerical variables is by using one-hot-encoding scheme. One-hot encoding creates a "dummy" variable for each possible category of each non-numeric feature.

Additionally, we need to convert the non-numerical target label to numerical value for the learing algorithm to work.

In [8]:
# One hot encoding the feature raw data using panadas.get_dummies()
features = pd.get_dummies(features_raw,columns=['workclass','education_level','marital-status','occupation',
                                               'relationship','race','sex','native-country'])
income = income_raw.apply(lambda x: 1 if x == ">50K" else 0)

# Print the number of features after one-hot encoding
encoded = list(features.columns)
print "{} total features after one-hot encoding.".format(len(encoded))
103 total features after one-hot encoding.
In [9]:
features.head(1)
Out[9]:
age education-num capital-gain capital-loss hours-per-week workclass_ Federal-gov workclass_ Local-gov workclass_ Private workclass_ Self-emp-inc workclass_ Self-emp-not-inc ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 0.30137 0.8 0.667492 0.0 0.397959 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

1 rows × 103 columns

Shuffle and Split Data

As always, we will split the data into training and test set. 80% of the data will be used for training and 20% for testing.

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(features, income, test_size = 0.2, random_state = 10)

print "Training set has {} samples".format(X_train.shape[0])
print "Test set has {} samples".format(X_test.shape[0])
Training set has 36177 samples
Test set has 9045 samples

Evaluating Model performance

In this section, we will investigate multiple algorithms, and determine which is best at modelling the data.

Metrics

Identifying someone that does not make more than \$50,000 as someone who does would be detrimental, since we are looking for find individuals willing to donate. Therefore, a model's ability to precisely predict those that make more than $50,000 is mor important than the model's ability to recall those individuals. We can use F-beta score as a metric that considers both precision and recall:

\begin{equation} F_{\beta} = (1+\beta^2)\times \frac{precision\times recall}{(\beta^2 \times precision) + recall} \end{equation}

In particular, when $\beta = 0.5$, more emphasis is placed on precision.

Model 1 - Naive Predictor Performance

If we choose a model that always predict an individual made more than \$50,000, what would tha model's accuracy and F-score be on the dataset.

In [11]:
# Calculate the accuracy 
accuracy = n_greater_50k / n_records

# Calculate F-score using the formula above for beta = 0.5
precision = n_greater_50k/n_records
recall = n_greater_50k / n_greater_50k

fscore = (1+0.5**2)*(precision*recall)/(0.5**2*precision+recall)
print "Naive predictor: [Accuracy scoare {:.4f}, F-score: {:.4f}.]".format(accuracy, fscore)
Naive predictor: [Accuracy scoare 0.2478, F-score: 0.2917.]

Supervised Learning Models

The following supervised learning models are currently available in sciki-learn:

  • Gaussian Naive Bayes
  • Decision Trees
  • Ensemble methods (Bagging, Adaboost, Random forest, Gradient Boosting)
  • K-Nearest Neighbors
  • Stochastic Gradient Descent Classifier (SGDC)
  • Support Vector Machine (SVM)
  • Logistic Regression

Model comparison

Gaussian Naive Bayes

  • Real world application: document classification and spam filtering
  • Stength: Super simple, you're just doing a bunch of counts. If the NB conditional independence assumption actually holds, a naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumpiton doesnot hold, a NB classifier still often does a great job in practice. A good bet if you want something fast and easy that performs very well.
  • Weakness: Its main weakness is that it cannot learn interaction between features. Although NB is known as a decent classifier, it is known to be a bad estimator, so the probability outputs are not to be taken too seriously.
  • This could not be good for the data, since there are many categorical variables in the dataset. (This could be used BernoulliNB.) The mixing of categorical and continous variables would be a problem for the model in scikit-learn.

Decision Tree

  • Real world application: Decision Tree application
  • Strength: Simple to understand and to interpret. Trees can be visualized. They easily handle feature interaction and they are non-parametric. They requires litte data preparation. Able to handle both numerical and categorical data. Able to handle multiple-output problems. Use a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. Possible to validate a model using statistical tests. Performs well even if its assumption are somewhat violated by the true model from which the data were generated.
  • Weakness: Easy to overfit. Decision tree can be unstable because small variations in the data might result in completely different trees being generated. The training algorithm cannot guarantee to return the globally optimal decision tree. Decision tree learners create biased tree if some classes dominate. It is recommend to balance the dataset to fitting with the decision tree;(Bagging and boosting perform better with unbalanced dataset)
  • This could not good for the data, since decision tree with ensemble method is preferrable. And the data is not balanced.

Random Forest

  • Real world application: used for all the applications with decision trees
  • Strength: runtimes are quite fast to train, and be able to work with unbalanced dataset. It provides estimates of what variables are important in the classification.
  • Weakeness: for regressor they cannot predict beyond the range of the taining data.They may overfit the dataset that are particular noisy. They also take time to make prediction.
  • This could be a good option to try.

AdaBoost

  • Real world application: Basketball player recognition
  • Strength: can be used with many different classifiers. Improves classification accuracy. Simple to implement. Not prone to overfitting.
  • Weakness: suboptimal solution. Sensitive to noisy data and outliers.
  • This is selected as an option.

Gradient Boosting

  • Real world application: many winning solutions on kaggle
  • Strength: Boosting methods have 3 parameters to train shrinkage parameter, depth of tree, number of tree. If you are able to use correct tuning parameters, they general give somewhat better result than Random forest.
  • Weakness: you cannot just take the maximum of number of trees in this case GBM can overfit.
  • This is selected as an option.

K-nearest neighbors

  • Real world application: simple method for classification and regression.
  • Strength: K-NN is among the simplest of all machine learning algorithms. Robust to noisy training data. Effective if the training data is large.
  • Weakness: Need to determine value of parameter K. Which type of distance to use. Computation cost is quite high.
  • This would not be a good option, since the distance among data points is not easily determined with mixed types.

Stochastic Gradient descent classifier

  • Real world application: it is a popular algorithm for training a wide range of models in machine learning.
  • Strength: Efficiency. Ease of implementation.
  • Weakness: SGD requires a number of hyperparameters such as regularization parameter and the number of iteration. SGD is sensitive to feature scaling.
  • This could be selected as an option.

Support vector machine

  • Real world application: SVM application list
  • Strength: effective in high dimensional spaces. Still effective in cases where number of dimensions is greater than the number of samples. Uses a subset of training points in the decision function (called support vectors), so it is memory efficient. Versatile.
  • Weakness: If the number of features is much greater than the number of samples, the method is likely to give poor performance. SVMs do not directly provide probability estimates.

Implementation - creating a training and predicting pipeline

To properly evaluate the performance of each model, it's important that you create a training and predicting pipeline that allows you to quickly and effectively train model using varouse sizes of training data and perform prediction on the testing data.

  • Use $F_{beta}\_score$ and accuracy score from sklearn.metrics
  • Fit learner to the sampled training data and record the training time.
  • Perform prediction on the test data (record the totoal prediction time)
  • Calculate the accuracy score for both training subset and testing set
  • Calculate the F-score for both training subset and test set.
In [12]:
from sklearn.metrics import fbeta_score, accuracy_score
from time import *
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
    results = {}
    
    start = time()
    learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
    end = time()
    
    results['train_time'] = end-start;
    
    # Get the prediction on the test set then get prediction on the first training sample
    start = time()
    prediction_test = learner.predict(X_test)
    prediction_train= learner.predict(X_train[:300])
    end = time()
    
    results['pred_time'] = end - start
    
    results['acc_train'] = accuracy_score(y_train[:300], prediction_train)
    results['acc_test'] = accuracy_score(y_test, prediction_test)
    
    results['f_train'] = fbeta_score(y_train[:300], prediction_train, beta=0.5)
    results['f_test'] = fbeta_score(y_test, prediction_test, beta=0.5)
    print "{} trained on {} samples".format(learner.__class__, sample_size)
    return results

Implementation: Initial Model Evaluation

In [13]:
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import AdaBoostClassifier as Ad
from sklearn.ensemble import GradientBoostingClassifier as GBM
from sklearn.linear_model import SGDClassifier as SGD

clf_A = RF(random_state=10)
clf_B = Ad(random_state=10)
clf_C = GBM(random_state=10)
clf_D = SGD(random_state=10)

samples_1 = int(len(Y_train)/100)
samples_10 = int(len(Y_train)/10)
samples_100 = len(Y_train)

results = {}
for clf in [clf_A, clf_B, clf_C, clf_D]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1,samples_10, samples_100]):
        results[clf_name][i] = train_predict(clf, samples, X_train, Y_train, X_test, Y_test)
<class 'sklearn.ensemble.forest.RandomForestClassifier'> trained on 361 samples
<class 'sklearn.ensemble.forest.RandomForestClassifier'> trained on 3617 samples
<class 'sklearn.ensemble.forest.RandomForestClassifier'> trained on 36177 samples
<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> trained on 361 samples
<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> trained on 3617 samples
<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'> trained on 36177 samples
<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> trained on 361 samples
<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> trained on 3617 samples
<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'> trained on 36177 samples
<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> trained on 361 samples
<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> trained on 3617 samples
<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> trained on 36177 samples
In [14]:
import matplotlib.patches as mpatches
fig, ax = plt.subplots(2, 3, figsize = (15,7))

# Constants
bar_width = 0.2
colors = ['#2E86C1','#C0392B','#AAB7B8','#000A00']

# Super loop to plot four panels of data
for k, learner in enumerate(results.keys()):
    for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_test', 'f_test']):
        for i in np.arange(3):

            # Creative plot code
            ax[j/3, j%3].bar(i+k*bar_width, results[learner][i][metric], width = bar_width, color = colors[k], 
                             label = learner)
            ax[j/3, j%3].set_xticks([0.45, 1.45, 2.45])
            ax[j/3, j%3].set_xticklabels(["1%", "10%", "100%"])
            ax[j/3, j%3].set_xlabel("Training Set Size")
            ax[j/3, j%3].set_xlim((-0.1, 3.0))

# Add unique y-labels
ax[0, 0].set_ylabel("Time (in seconds)")
ax[0, 1].set_ylabel("Accuracy Score")
ax[0, 2].set_ylabel("F-score")
ax[1, 0].set_ylabel("Time (in seconds)")
ax[1, 1].set_ylabel("Accuracy Score")
ax[1, 2].set_ylabel("F-score")

# Add titles
ax[0, 0].set_title("Model Training")
ax[0, 1].set_title("Accuracy Score on Training Subset")
ax[0, 2].set_title("F-score on Training Subset")
ax[1, 0].set_title("Model Predicting")
ax[1, 1].set_title("Accuracy Score on Testing Set")
ax[1, 2].set_title("F-score on Testing Set")

# Add horizontal lines for naive predictors
ax[0, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
ax[1, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
ax[0, 2].axhline(y = fscore, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
ax[1, 2].axhline(y = fscore, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')

# Set y-limits for score panels
ax[0, 1].set_ylim((0, 1))
ax[0, 2].set_ylim((0, 1))
ax[1, 1].set_ylim((0, 1))
ax[1, 2].set_ylim((0, 1))

# Create patches for the legend
patches = []
for i, learner in enumerate(results.keys()):
    patches.append(mpatches.Patch(color = colors[i], label = learner))
plt.legend(handles = patches, bbox_to_anchor = (-.80, 2.73), \
           loc = 'upper center', borderaxespad = 0., ncol = 3, fontsize = 'x-large')

# Aesthetics
plt.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize = 16, y = 1.20)
plt.tight_layout()
plt.show()
/home/jinux/anaconda2/envs/venv/lib/python2.7/site-packages/ipykernel/__main__.py:14: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/home/jinux/anaconda2/envs/venv/lib/python2.7/site-packages/ipykernel/__main__.py:16: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/home/jinux/anaconda2/envs/venv/lib/python2.7/site-packages/ipykernel/__main__.py:17: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/home/jinux/anaconda2/envs/venv/lib/python2.7/site-packages/ipykernel/__main__.py:18: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/home/jinux/anaconda2/envs/venv/lib/python2.7/site-packages/ipykernel/__main__.py:19: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

Adaboost with 50 weak learner as decision tree with depth as 1. Gradient boosting with 100 trees, max_depth is 3. The training time of GBM grows much faster than Adaboost. The prediction time of Adaboost is higher than other methods.

Improving results

Choose the best model from the above supervised models. From the above initial model evaluation, we could see that GBM method have high accuracy and F score for both training and test dataset. Even it requires higher training time, but the prediction time of GBM is relatively low.

In [15]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

# Initialize the classifier
clf = GBM(random_state=10,)

# Create the parameters to tune
parameters = {'loss':['deviance', 'exponential'],
              'learning_rate':[0.1,1 ,1.2],
              'n_estimators':[10,90, 200,300],
              'max_depth':[1,3, 5, 7]}

# make a f_beta_score
score = make_scorer(fbeta_score,beta = 0.5)

# GridSearch 
gridObj = GridSearchCV(clf, param_grid=parameters, scoring=score, cv=5, n_jobs=10)

# for the parameters
grid_fit = gridObj.fit(X_train, Y_train) 
/home/jinux/anaconda2/envs/venv/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/jinux/anaconda2/envs/venv/lib/python2.7/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
In [16]:
best_predictor = grid_fit.best_estimator_
predictions = clf.fit(X_train, Y_train).predict(X_test)
best_predictor = best_predictor.predict(X_test)
In [17]:
print fbeta_score(Y_test, predictions, beta=0.5)
print fbeta_score(Y_test, best_predictor, beta=0.5)
0.745530012771
0.75184501845
In [ ]: