Titanic machine learning from disaster

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

Getting Started

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
In [2]:
train.head(4)
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

From the data, we can see the features presented for each passager on the ship:

  • Survived: Outcome of target (0 = No survival, 1 = Survival)
  • PassengerId: the ID for each passenger
  • Pclass: ticket class (a proxy for socio-economic class 1 = upper, 2 = middle, 3 = lower)
  • Name: the name for each passenger
  • Sex: the gender of each passenger
  • Age: the age of each passenger
  • SibSp: # of siblings/spouses aboard the Titanic
  • Parch: # of parents/children aboard the Titanic
  • Ticket: ticket number
  • Fare: passenage fare
  • Cabin: cabin number
  • Embarked: port of embarkation.
In [3]:
train.describe()
Out[3]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

For numeric features, we can use describe() function to generate the count, mean, std, min, the first quartile, median, the third quartile, max. From the count row, we can see that the Age feature has 177 missing values.

In [4]:
print train.shape
print "----------------"
print train.dtypes
(891, 12)
----------------
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

shape attribute returns the size of data frame. dtypes attribute returns the data type of features.

Some visualization is better for understanding the data

In [5]:
sns.countplot(train.Survived, order=[0,1]);
plt.xlabel('Survived or not (0 for not survived, 1 for survived)');
plt.ylabel('Count of passager');
In [6]:
## Age figure
fig = plt.figure(figsize=[8,2])

age_data_full = train[['Age','Survived']]
age_data = age_data_full.ix[~np.isnan(age_data_full.Age),:]
plt.figure(figsize=[12,12])
sns.distplot(age_data.Age,ax=plt.subplot(311),kde=False);
plt.xlim([0,80])
plt.ylabel('Count of Passengers')
plt.title('Age distribution (177 missing values)');

sns.barplot(x='Age',y='Survived',data=age_data,ax=plt.subplot(312),ci=None);
plt.xticks([0,88],[0,80])
age_data_missing = age_data_full.ix[np.isnan(age_data_full.Age),:]
plt.subplots_adjust(hspace=0.4)
plt.subplot(313)
sns.countplot(age_data_missing['Survived'])
plt.title("The survived count in missing values of Age")
plt.show();
<matplotlib.figure.Figure at 0x7ff4083ab090>

From the figures, we could see that age would be a good indicator for the target variable (younger and older age have obviously higher survivial rate.). The missing value did not provide any useful information.

In [7]:
### Sex figure
fig = plt.figure(figsize=[6,4])
sns.countplot(train['Sex'],ax=plt.subplot(121));
plt.title('Sex distribution')
plt.subplots_adjust(wspace=0.5)
sns.barplot(x='Sex',y='Survived',ax=plt.subplot(122),
            data=train,ci=None);
plt.title('Survived rate for different gender')
Out[7]:
<matplotlib.text.Text at 0x7ff40b82cc90>

From the figure, we could see that female has a significantly higher survivial rate than male.

In [8]:
### Pclass figure
fig = plt.figure(figsize=[6,4])
sns.countplot(train['Pclass'],ax=plt.subplot(121))
plt.title('Pclass distribution')
plt.subplots_adjust(wspace=0.8)
sns.barplot(x='Pclass',y='Survived',data=train,ci=None,
           ax=plt.subplot(122))
plt.title('Survived rate for different socio-economic class')
Out[8]:
<matplotlib.text.Text at 0x7ff407d5afd0>

Hah, economic status has a significant effect on survivial rate.

In [9]:
### Relatives aboard 
fig = plt.figure(figsize=[9,6])
plt.subplots_adjust(hspace=0.8,wspace=0.8)
sns.countplot(train['Parch'],ax=plt.subplot(231))
plt.title('ParCh distribution')
sns.countplot(train['SibSp'],ax=plt.subplot(232))
plt.title('SibSp distribution')
plt.subplot(233)
sns.regplot(x='Parch',y='SibSp',data=train, fit_reg=False)
plt.title("The scatter plot of SibSp and Parch.\n \
The correlation coefficient is {0:.2}"
          .format(np.corrcoef(train['Parch'],train['SibSp'])[0,1]))
sns.barplot(x='Parch',y='Survived',data=train,
            ci=None,ax=plt.subplot(234))
plt.title('Survived rate for ParCh')
sns.barplot(x='SibSp',y='Survived',data=train,
            ci=None,ax=plt.subplot(235))
plt.title('Survived rate for SibSp')
sns.barplot(x=train['SibSp']+train['Parch'],y=train['Survived'],
            ci=None,ax=plt.subplot(236))
plt.title('Survived rate for SibSp + Parch')
Out[9]:
<matplotlib.text.Text at 0x7ff40665a350>
In [10]:
### cabin figure
# Preprocessing Cabin number to 
train.ix[pd.isnull(train['Cabin']),'Cabin']=0
train.ix[train['Cabin']!=0,'Cabin'] = \
[len(ele.split()) for ele in train.Cabin if ele != 0]
In [11]:
plt.figure(figsize=[8,3])
plt.subplots_adjust(wspace=0.8)
sns.countplot(train.Cabin,ax=plt.subplot(121))
plt.title("Cabin number distribution")
sns.barplot(x='Cabin',y='Survived',data=train,
            ci=None,ax=plt.subplot(122))
plt.title("Survived rate for different Cabin number")
Out[11]:
<matplotlib.text.Text at 0x7ff40645c2d0>

The passenagers with more cabins have higher survivial rate. However, we need to explore the collinearity between Cabin number, Fare and Pclass;

In [12]:
### Fare figure
plt.figure(figsize=[6,8])
plt.subplots_adjust(hspace=0.4)
sns.distplot(train.Fare,ax=plt.subplot(211))
plt.title("The distribution of Fare")
sns.violinplot(x='Survived',y='Fare',data=train,ax=plt.subplot(212))
plt.title("The Fare distribution versus survived or not")
Out[12]:
<matplotlib.text.Text at 0x7ff408174250>

Of course, passenager with higher fare have a higher survivied rate.

In [13]:
### bivariate visualization
plt.figure(figsize=[9,12])
plt.subplots_adjust(hspace=0.4)
sns.violinplot(x='Pclass',y='Fare',data=train,jitter=True,
              ax=plt.subplot(311))
sns.violinplot(x='Pclass',y='Cabin',data=train,
             ax=plt.subplot(312),jitter=True)
sns.violinplot(x='Cabin',y='Fare',data=train, jitter=True,
             ax=plt.subplot(313))
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff407ef69d0>
In [14]:
np.corrcoef(train.ix[:,['Pclass','Cabin','Fare']].transpose())
Out[14]:
array([[ 1.        , -0.6471164 , -0.54949962],
       [-0.6471164 ,  1.        ,  0.59617121],
       [-0.54949962,  0.59617121,  1.        ]])

There are certain correlation among these three features, but not very strong.